Dense machine code from C++ code (compiler optimizations)

Marcus

unread,

Nov 21, 2021, 12:13:13 PM11/21/21

to

Just wrote this short post. Maybe someone finds it interesting...

https://www.bitsnbites.eu/i-want-to-show-a-thing-cpp-code-generation/

/Marcus

Terje Mathisen

unread,

Nov 21, 2021, 4:44:30 PM11/21/21

to

I didn't know that your target machine has bit field insert/extract
opcodes, otherwise the code generated was exactly as I expected. :-)

Terje
PS. Mill would get pretty much identical results, typically inlined so
no RET opcode.
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

BGB

unread,

Nov 21, 2021, 5:14:53 PM11/21/21

to

On 11/21/2021 11:13 AM, Marcus wrote:
> Just wrote this short post. Maybe someone finds it interesting...
>
> https://www.bitsnbites.eu/i-want-to-show-a-thing-cpp-code-generation/
>

I guess this points out one limitation of my compiler (relative to GCC)
is that for many cases it does a fairly direct translation from C source
to the machine code.

It will not optimize any "high level" constructions, but instead sort of
depends on the programmer to write "reasonably efficient" C.

Such a case would not turn out nearly so nice in my compiler though (if
it supported C++), but alas.

Trying to port GCC looks like a pain though, as its codebase is pretty
hairy and it takes a fairly long time to rebuild from source (compared
with my compiler; which rebuilds in a few seconds).

Well, also my compiler can recompile Doom in ~ 2 seconds, whereas GCC
seemingly takes ~ 20 seconds to recompile Doom.

There are a few cases, like it will recognize a few special cases for
memcpy and convert them into inline sequences or register moves, but in
most areas it is still "pretty dumb".

Also doesn't perform function inlining or other high-level
transformations, and at its current defaults does not make use of strict
aliasing (default case is more conservative).

I have also noticed that its code-generation seems to be a little
overzealous with the application of "volatile", since it isn't always
necessarily obvious which exact aspect needs to be volatile, at present
it turns anything that touches a volatile variable into fairly obvious
"Ld, Op, St" or "Ld, Ld, Op, St" sequences (it effectively disables the
use of the register allocator).

Otherwise, one would have to possibly add some way to distinguish:
The value of this variable is volatile;
Vs, the memory at the address pointed to by this volatile pointer is
volatile.

It does optimize a few minor things though, like:
unsigned int ui, uj;
...
uj = ui % 16;
Will get transformed into:
uj = ui & 15; //applies to power-of-2 and unsigned
...

Divide by constant may also be replaced with multiply-by-reciprocal and
similar (normally divide and modulo are implemented via function calls
into the C runtime).

So, internally:
c = a / b;
Maps to, say:
c = __sdivsi3(a, b); //no hardware divide operation

Also recently added optimizations which transform, eg:
c = a + 0;
d = 0 | a;
if(a==a) { ... } /* if 'a' is not floating-point */
...
Into:
c = a;
d = a;
if(1) { ... }
...

Did also recently add an optimization to reduce the number of internal
type promotion-conversions used for operators like "==" and similar (in
cases where doing so was unnecessary).

...

There is an optimization which eliminates stack-frame creation for small
leaf functions:
Can't call anything (including implicit runtime calls);
Can't access global variables (ABI reasons);
Can't use any operations which allocate temporary registers;
Limited to 10 variables (arguments + locals + temporaries);
Can't take the address of any variable;
...

This basically means that these functions can use:
Basic pointers and simple arrays;
Pointers to structs (but not value-type structs);
A basic set of ALU operators and casts.
Safe: +, -, &, |, ^, <<, >>
Sometimes: * (small types only), % (unsigned with small power-of-2)
Many other cases: *, /, % may use temporaries and/or calls.
Basic integer and primitive pointer types.

But, then recently found/fixed a bug where it was sometimes still trying
to spill variables to the stack despite not having a stack-frame, which
was in turn corrupting the stack frame of the caller.

And, a bug where declaring an array like:
char *foo[] = { ... };
...
n=sizeof(foo);
Was giving the size as "sizeof(char *)", rather than the size the array
was initialized to (and currently still only works for global and static
arrays).

...

And, all this was while trying to hunt down another bug which seems to
result in Doom demos sometimes desyncing (in a way different from the
x86 builds); also a few other behavioral anomalies which have shown up
as bugs in Quake, ...

Well, also the relatively naive register allocation strategy doesn't help:
For each basic block, whenever a variable is referenced (that is not
part of the "statically reserved set"), it is loaded into a register
(and temporarily held there), and at the end of the basic-block,
everything is spilled back to the stack.

There are, in theory, much better ways to do register allocation.

Though, one useful case is that the register space is large enough to
where a non-trivial number of functions can use a "statically reserve
everything" special case. This can completely avoid spills, but only for
functions within a strict limit for the number of variables (limited by
the number of callee save registers, or ~ 12 variables with 32 GPRs).

For most functions, this case still involves the creation of a stack
frame and similar though (mostly to save/restore registers).

This case still excludes using a few features:
Use of structs as value types (structs may only be used as pointers);
Taking the address of any variable;
Use of VLAs or alloca;
...

But, this case does allow a few things:
Calling functions;
Operators which use scratch registers;
Accessing global variables;
...

With the expanded GPR space, the scope of the "statically assign
everything" case could be be expanded, but still haven't gotten around
to making BGBCC able to use all 64 GPRs directly (and I still don't
consider them to be part of the baseline ISA, ...). This could (in
theory) expand the limit to ~ 28 or so.

If BGBCC supported C++, I don't have much confidence for how it would
deal with templates.

But, otherwise, mostly more trying to find and fix bugs and similar at
the moment. But, often much of the effort is trying to actually find the
bugs (since "demo desyncs for some reason" isn't super obvious as to why
this is happening, or where in the codebase the bug is being triggered,
...).

> /Marcus

robf...@gmail.com

unread,

Nov 21, 2021, 5:55:08 PM11/21/21

to

cc64 compiler cheats and has a bit-slice operator that allows the compiler
to see directly when bit-field operations are needed.
A line like: “a[63:40] = b[23:0];” compiles into an extract and insert.

MitchAlsup

unread,

Nov 21, 2021, 6:27:37 PM11/21/21

to

That is not cheating !
The semantic of the program has been obeyed.
>
{Although in that case, a single left shift would have sufficed.}
>
Although::
struct { uint64_t a: 20,
filler: 19,
c: 18; } t;
>
t.a=t.c //does need an extract and an insert
>

BGB

unread,

Nov 22, 2021, 12:13:58 AM11/22/21

to

I bit-slice operator could be useful, although BJX2 lacks explicit
bit-extract or bit-insert operations (they need to be built via shift
and mask).

It seems like both extract and insert could be generalized into a
combined "shift-and-mask" operator. Though, this would require a
mechanism to either supply or create the bit-mask.

Eg:
Rn=((Rm<<Ro)&Mask)|(Rn&(~Mask)).
Or (4R):
Rn=((Rm<<Ro)&Mask)|(Rp&(~Mask)).

So, if the Mask is all ones, it behaves as a normal shift, but otherwise
one masks off the bits they want to keep from the destination register.

Could synthesize a mask from a shift and count, harder part is coming up
with a way to do so cheaply (when the shift unit is already in use this
cycle).

Could be done more simply via multiple ops:
SHLD.Q (left as-is)
BITSEL Rm, Ro, Rn // Rn=(Rm&Ro) | (Rn&(~Ro))
BITSEL Rm, Ro, Rp, Rn // Rn=(Rm&Ro) | (Rp&(~Ro))

So, extract is, say:
SHLD.Q R4, -24, R3
AND 511, R3
And, insert is, say:
MOV 511, R7
SHLD.Q R3, 52, R6 | SHLD.Q R7, 52, R7
BITSEL R6, R7, R8

Still need to think about it.

Or such...

Ivan Godard

unread,

Nov 22, 2021, 1:06:34 AM11/22/21

to

Consider also signed extract, and overflow-checking insert.

Marcus

unread,

Nov 22, 2021, 2:10:56 AM11/22/21

to

On 2021-11-21 kl. 23:14, BGB wrote:
> On 11/21/2021 11:13 AM, Marcus wrote:
>> Just wrote this short post. Maybe someone finds it interesting...
>>
>> https://www.bitsnbites.eu/i-want-to-show-a-thing-cpp-code-generation/
>>
>
> I guess this points out one limitation of my compiler (relative to GCC)
> is that for many cases it does a fairly direct translation from C source
> to the machine code.
>
> It will not optimize any "high level" constructions, but instead sort of
> depends on the programmer to write "reasonably efficient" C.

I would suspect that. That was one of the points in my post: GCC and
Clang have many, many, many man-years of work built in, and it's very
hard to compete with them if you start fresh on a new compiler.

I also have a feeling that the C++ language is at a level today that
it's near impossible to write a new compiler from scratch. It's not only
about the sheer amount of language features (classes, lambdas,
templates, auto, constexpr, ...) and std library (STL, thread, chrono,
...), but it's also about the expectations about how the code is
optimized. C++ places a huge burden on the compiler to be able to
resolve lots of things at compile time (e.g. constexpr essentially
requires that the C++ code can be executed at compile time).

> Such a case would not turn out nearly so nice in my compiler though (if
> it supported C++), but alas.
>
> Trying to port GCC looks like a pain though, as its codebase is pretty
> hairy and it takes a fairly long time to rebuild from source (compared
> with my compiler; which rebuilds in a few seconds).

Yes, it has taken me years, and the code base and the build system is
not modern by a long shot. A complete rebuild of binutils +
bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X. An
incremental build of GCC when some part of the machine description has
changed (e.g. an insn description was added) takes about a minute.

OTOH it would probably have taken me even longer to create my own
compiler (especially as I'm not very versed in compiler architecture),
so for me it was the less evil of options (I still kind of regret that
I didn't try harder with LLVM/Clang, though, but I have no evidence
that the grass is greener over there).

>
> Well, also my compiler can recompile Doom in ~ 2 seconds, whereas GCC
> seemingly takes ~ 20 seconds to recompile Doom.
>

Parallel compilation. Using cmake + ninja the GCC/MRISC32 build time for
Doom is 1.3 s (10 s without parallel compilation). The build time for
Quake is 1.6 s (13 s without parallel compilation).

But I agree that a fast compiler is worth a lot. I work with a DSP
compiler that can take ~5 minutes to compile a single object file that
takes ~10 seconds to compile with GCC. It's a real productivity killer.

>

[snip]

> And, all this was while trying to hunt down another bug which seems to
> result in Doom demos sometimes desyncing (in a way different from the
> x86 builds); also a few other behavioral anomalies which have shown up
> as bugs in Quake, ...
>

FixedDiv and FixedMul are very sensitive in Doom. If you approximate the
16.16-bit fixed point operation (say, with 32-bit floating-point), or
get the result off by a single LSB, the demos will start running off
track very quickly ;-) I found this out back in the 1990s when I ported
Doom to the Amiga and tried to pull some 68020/030 optimization tricks.

Marcus

unread,

Nov 22, 2021, 2:29:50 AM11/22/21

to

On 2021-11-21 22:44, Terje Mathisen wrote:
> Marcus wrote:
>> Just wrote this short post. Maybe someone finds it interesting...
>>
>> https://www.bitsnbites.eu/i-want-to-show-a-thing-cpp-code-generation/
>>
>> /Marcus
> I didn't know that your target machine has bit field insert/extract
> opcodes,

It originally didn't, but when I found the sweet mc88k bit field trick I
replaced the bit shift instructions with bit field instructions. :-)

The IBF (Insert Bit Field) instruction is the latest addition (got
implemented three weeks ago). My VHDL implementation is kind of a mess
though (it has evolved organically from plain shift operations), and
eats too much FPGA logic resources ATM. I'll have to revisit that some
time.

> otherwise the code generated was exactly as I expected. :-)
>
> Terje
> PS. Mill would get pretty much identical results, typically inlined so
> no RET opcode.

Well, for the article I decided to isolate the functions to make it
easier to follow, but otherwise you'd typically inline these kind of
functions.

/Marcus

Ivan Godard

unread,

Nov 22, 2021, 3:06:30 AM11/22/21

to

The rule of thumb twenty years ago was that a new production-grade
compiler cost $100M$ and five years. I doubt the cost has gone down. The
Mill tool chain, even using clang for front and middle end and not
including linking, is ~30k lines of pretty tight C++. That ain't cheap.

BGB

unread,

Nov 22, 2021, 4:40:15 AM11/22/21

to

On 11/22/2021 1:10 AM, Marcus wrote:
> On 2021-11-21 kl. 23:14, BGB wrote:
>> On 11/21/2021 11:13 AM, Marcus wrote:
>>> Just wrote this short post. Maybe someone finds it interesting...
>>>
>>> https://www.bitsnbites.eu/i-want-to-show-a-thing-cpp-code-generation/
>>>
>>
>> I guess this points out one limitation of my compiler (relative to
>> GCC) is that for many cases it does a fairly direct translation from C
>> source to the machine code.
>>
>> It will not optimize any "high level" constructions, but instead sort
>> of depends on the programmer to write "reasonably efficient" C.
>
> I would suspect that. That was one of the points in my post: GCC and
> Clang have many, many, many man-years of work built in, and it's very
> hard to compete with them if you start fresh on a new compiler.
>
> I also have a feeling that the C++ language is at a level today that
> it's near impossible to write a new compiler from scratch. It's not only
> about the sheer amount of language features (classes, lambdas,
> templates, auto, constexpr, ...) and std library (STL, thread, chrono,
> ...), but it's also about the expectations about how the code is
> optimized. C++ places a huge burden on the compiler to be able to
> resolve lots of things at compile time (e.g. constexpr essentially
> requires that the C++ code can be executed at compile time).
>

I have currently put full C++ support on the "not going to do it" shelf...

There is a subset along similar lines to EC++, but hardly any C++ code
(which actually uses the STL or C++ standard library) would actually
work with such a subset.

>> Such a case would not turn out nearly so nice in my compiler though
>> (if it supported C++), but alas.
>>
>> Trying to port GCC looks like a pain though, as its codebase is pretty
>> hairy and it takes a fairly long time to rebuild from source (compared
>> with my compiler; which rebuilds in a few seconds).
>
> Yes, it has taken me years, and the code base and the build system is
> not modern by a long shot. A complete rebuild of binutils +
> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X. An
> incremental build of GCC when some part of the machine description has
> changed (e.g. an insn description was added) takes about a minute.
>

Rebuilding GCC for RISC-V took around 20 minutes on my 2700X, but
granted, this was on a platter drive which was running low on available
disk space; and on WSL.

Rebuilding BGBCC with MSVC takes around 3 seconds.
Rebuilding BGBCC with GCC takes around 56 seconds.

> OTOH it would probably have taken me even longer to create my own
> compiler (especially as I'm not very versed in compiler architecture),
> so for me it was the less evil of options (I still kind of regret that
> I didn't try harder with LLVM/Clang, though, but I have no evidence
> that the grass is greener over there).
>

I fiddled around with compilers and language design for a long time
before I got into ISA design or FPGAs (for me, I started fiddling with
writing language interpreters back in high-school; and BGBCC started out
as a fork off a project I started working on directly after high-school,
namely writing a custom JavaScript knock-off with the intention of using
it as a scripting language in my other projects).

Initially, BGBCC was also intended as a script-interpreter, just parsing
a C variant rather than a JS variant, but at the time I had found that C
was much worse suited to the task, and debugging a C compiler was
considerably harder than a JS-like compiler (despite their superficial
syntactic similarity).

But, then it beat around for some-odd years being used mostly as an "FFI
glue" tool.

I had BGBCC laying around from some past projects of mine. It was pretty
awful, and wasn't really used much for anything "non-trivial" but it
sorta had most of the basics in place.

For my newer ISA projects, I was like, "well, I will dust off what I
have and just use that".
Actual 2nd place option at the time was LCC, but it didn't look like LCC
would have offered that much over what I had already.

Though, BGBCC has expanded to around 5x as much code as it was when I
started this project. Most of this is in the backend, and a lot of new
special cases in the frontend, ...

It does have a little big of funkiness in that it works very differently
from GCC. It does not use object files, but instead uses a
stack-oriented bytecode as an IR stage (and internally generates the 3AC
IR code from the stack-machine code).

...

>>
>> Well, also my compiler can recompile Doom in ~ 2 seconds, whereas GCC
>> seemingly takes ~ 20 seconds to recompile Doom.
>>
>
> Parallel compilation. Using cmake + ninja the GCC/MRISC32 build time for
> Doom is 1.3 s (10 s without parallel compilation). The build time for
> Quake is 1.6 s (13 s without parallel compilation).
>

I was invoking it via a similar approach to how one uses MSVC, namely:
gcc -o whatever ...all the source files... (options)

Generally, the compiler speed ranking tends to be:
MSVC, fastest;
BGBCC, 2nd fastest;
GCC and Clang, considerably slower.

Though, GCC and Clang are able to generate code that is typically a fair
bit faster than what MSVC can pull off.

BGBCC basically runs as a single thread in a single process.
It does cache loaded header files and similar in RAM, mostly because
opening/closing/reading files is fairly slow (and otherwise can become a
bottleneck).

> But I agree that a fast compiler is worth a lot. I work with a DSP
> compiler that can take ~5 minutes to compile a single object file that
> takes ~10 seconds to compile with GCC. It's a real productivity killer.
>

Hmm...

I get annoyed waiting much more than a few seconds.

>>
>
> [snip]
>
>> And, all this was while trying to hunt down another bug which seems to
>> result in Doom demos sometimes desyncing (in a way different from the
>> x86 builds); also a few other behavioral anomalies which have shown up
>> as bugs in Quake, ...
>>
>
> FixedDiv and FixedMul are very sensitive in Doom. If you approximate the
> 16.16-bit fixed point operation (say, with 32-bit floating-point), or
> get the result off by a single LSB, the demos will start running off
> track very quickly ;-) I found this out back in the 1990s when I ported
> Doom to the Amiga and tried to pull some 68020/030 optimization tricks.
>

I am handling these generally by using 64-bit integer math.

Previously, I had the Doom demos 1:1 between the x86 builds and the BJX2
builds (they would play out the same way), but recently something has
changed which is causing the 3rd demo in the demo-loop (for Ultimate
Doom, E3M5) to diverge near the end.

Namely, in the x86 builds (and previously on BJX2), the player would ram
into an imp and then get killed by a hell baron near the outside edge of
the central structure. Now, on BJX2, there is a divergence and the
player gets killed (by the baron) in a different location.

The main difference in this case seems to be that the imp behaves
differently and is in a different spot.

I had already been poking a bit at the division algo and similar, and
this seems to still be generating the correct results (things like
sign-extensions on shifts and similar are also tested via "sanity
checks", ...).

Behavior seems to still match up in Heretic and Hexen last I checked.

I have at least found and fixed some other bugs though.

The ROTT demos also desync pretty bad.

The behavior is also fairly stable in the face of switching around
codegen options, ..., implying it is probably not a low-level codegen issue.

In any case, this implies a behavioral / semantics issue somewhere.

Though, it does appear that this desync matches up with the behavior I
saw when I built this port of the Doom engine for the RasPi.

Or:
The first demo (E1M5) is desync'ed, but behavior is consistent between
all the targets;
The second demo (E2M2) remains in-sync and consistent on all targets;
The third demo (E3M5) diverges when the imp and baron are encountered,
with my BJX2 build switching from the x86 behavior to ARM/RasPi behavior
(for reasons I have not yet determined).

...

But, yeah, this stuff is super fiddly (and sensitive to very slight
disturbances).

For ROTT, there also seems to be a running RNG state, and differences in
enemy behavior will also cause the RNG to diverge, leading to a cascade
effect where everything goes to crap.

This is unlike Doom and friends where enemy behavior does not make use
of an RNG (ROTT deals with demos by initially reseeding the RNG with 0).

Paul A. Clayton

unread,

Nov 22, 2021, 9:50:26 AM11/22/21

to

On Monday, November 22, 2021 at 4:40:15 AM UTC-5, BGB wrote:
[snip]

> I get annoyed waiting much more than a few seconds.

No sword fighting?!
https://xkcd.com/303

Thomas Koenig

unread,

Nov 22, 2021, 10:54:21 AM11/22/21

to

Marcus <m.de...@this.bitsnbites.eu> schrieb:

> A complete rebuild of binutils +
> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X.

That is blindingly fast, it usually takes about half to 3/4 on an
hour with recent versions. Which version is your gcc port
based on?

Thomas Koenig

unread,

Nov 22, 2021, 12:47:03 PM11/22/21

to

Ivan Godard <iv...@millcomputing.com> schrieb:

> The rule of thumb twenty years ago was that a new production-grade
> compiler cost $100M$ and five years. I doubt the cost has gone down.

That seems a lot - 20 years ago, you could probably calculate with
cost of less than 100 K$ per programmer and year. That would
mean a 200 + strong team working on this full time.

Marcus

unread,

Nov 22, 2021, 12:50:19 PM11/22/21

to

I'm on GCC trunk (12.0). Same with binutils and newlib.

I use Ubuntu 20.04, and the HW is a 3900X (12-core/24-thread) CPU, with
an NVMe drive (~3 GB/s read).

I build with "make -j28" (although the GNU toolchain build system is
poorly parallelizable, so most of the cores are idle most of the time).

BTW, my build script is here [1].

You may experiencing the "Windows tax" [2] - i.e. Windows is almost
always slower than Linux (esp. when it comes to build systems).

/Marcus

[1] https://github.com/mrisc32/mrisc32-gnu-toolchain
[2] https://www.bitsnbites.eu/benchmarking-os-primitives

BGB

unread,

Nov 22, 2021, 1:50:56 PM11/22/21

to

The current version of BGBCC is ~ 250k lines of C.
It was around 50k lines when I started.

Of this:
16k, C parser (includes preprocessor)
44k, middle stages (AST -> RIL, RIL -> 3AC, Typesystem, ...)
76k, BJX2 backend
20k, support code (memory manager, AST backend, ...)
40k, original SH4 / BJX1 backend;
30k, BSR1 backend
5k, Stuff for WAD2A and WAD4
...

ASTs (Parser, AST->RIL) use a system I had been calling BCCX, which is
organized in terms of nodes. In an abstract sense, a node contains:
A collection of key/value attributes;
A collection of zero or more child nodes.

BCCX-1 (Prior):
Supported 1-6 attributes directly;
Going beyond 6 attributes would use an array;
Nodes were organized via linked-lists.

BCCX-2 (Partial redesign):
Supports 1-8 attributes directly;
Going beyond 8 attributes causes the nodes to split B-Tree style;
Child nodes are organized via a radix-16 array.
Resembles a B-Tree being used as an array;
Children 0-3 may be folded into attributes.

Following the creation of BCCX-2, the nominal way to access child nodes
is to access them with an index (like an array). This interface was
back-ported to BCCX-1 (if needed).

BCCX-2 switches from the old clean up mechanism (manual destruction, and
clean-up via linked lists), to a mechanism inspired by Doom's Z_Malloc
system (sweeping away any short-lived nodes via Zone-tags, and
propagating tags along the tree if a node is moved into a longer-lived
zone).

Both versions of BCCX store attributes using 16-bit keys:
4 bits, identifies the type of the attribute
12 bits, gives a symbolic name for the attribute.

With most BCCX calls, once supplies both a name (as a string), and a
location to cache its index number (these need to match).

There are a few special node tags as well, namely:
$cdata / !CDATA, Represents a large blob of raw texts.
$text / !TEXT , Represents a blob of bare text.
$list / !LIST , Represents a freestanding list of nodes.

The BCCX system is serialized as a certain likely familiar notation
(cough, XML).

ex:
int x, y;
y=x+1;
Might be parsed as if it were, say:
<vars>
<var name="x">
<type name="int"/>
</var>
<var name="y">
<type name="int"/>
</var>
</vars>
<assign name="y">
<value>
<binary op="+">
<left> <ref name="x"/> </left>
<right> <int value=1/> </right>
</binary>
</value>
</assign>

Which is effectively a tree organization which goes back to the first
"BGBScript VM" which I implemented nearly 20 years ago (and which BGBCC
originally started as a fork off of). Such a design approach sort of
seemed to make sense back when I was much younger.

Though, BCCX-2 is several generations removed from where it started out,
and BCCX itself started originally as a wrapper interface over the
original DOM to make it "less awkward" for AST building tasks (with the
use of cached-integer calls being a latter addition to make performance
"less awful").

ASTs can be dumped as XML for debugging and, in premise, can be reloaded
from an XML dump.

I did experiment with a few alternate notations for dumping the AST, but
they didn't really save enough to make it worthwhile. There was also a
binary format for serializing the ASTs (loosely inspired by WBXML), but
it wasn't used for much (the ASTs aren't useful for much else, nor was
there much reason to load externally generated ASTs into the compiler).

In the front-end, there are some "infer" and "reduce" functions, which
can be used to simplify/rewrite expressions, and to figure out the type
of an expression, ...

Once it hits the middle stages:
Types may get translated to "signature strings";
These use a character-based notation to encode types;
Basic notation was inspired by the JVM and C++ name mangling.
The ASTs are translated to stack-machine operations (RIL).

So:
y=x+1;
Maps to, say:
PUSH_LOAD(x)
PUSH_INT(1)
BINOPST(ADD, y)

Where many stack ops have a variant that can encode a destination as
this allows making the 3AC translation "somewhat less awful".

At one point, the stack-machine also used an ASCII notation, but this
was later dropped in favor of using a binary format.

Once the RIL is produced, the ASTs are no longer needed and are discarded.

RIL may be stored out to files (and reloaded from files), but is
typically fed back in to the middle stage to generate 3AC. RIL is
generally also how BGBCC deals with static libraries. I had experimented
with trying to make the RIL IR 32/64 bit agnostic, but this is generally
problematic. It otherwise doesn't depend all that much on the target
machine (beyond what is "baked in" via the pre-processor).

When we reach the 3AC stage:
Types are represented as a bit-packed integer format;
Held inside a struct wrapper to avoid mixed-up assignment.
Variables are also represented as bit-packed integers;
Also held inside struct wrappers.
The 3AC operations are represented with structs;
Operations are represented using numerical constants;
...

This notation isn't quite SSA form, but is sort of a pseudo-SSA. It
marks versions of variables, though doesn't generally allow them to
overlap in time. Generally, a "phi" operator isn't used as I haven't
really been able to get it to work correctly (would need to figure out
how to emit "correct" phi operators from the RIL->3AC stage).

The variables are split into several numbering spaces:
R/T: Temporaries (12.12 bit space);
A: Arguments (12.12 bit space);
L: Locals (12.12 bit space);
G: Globals (24-bit space);
I: Immediate (encoded inline or via an index into a table);
S: String Literals;
...

The bit-packing scheme encodes embedded value types, which actually
takes up the majority of the bits in the format (based on 64 bit integer
values).

As can be noted, a full-width 64-bit or 128-bit constant can't be
encoded directly via this scheme, but can be mapped to table lookups
(128-bit constants are encoded via pairs of entries in the 64-bit
constant table). Most smaller integer constants and similar can be
encoded directly without needing to store them in the table.

In the backend, it is mostly about using a register allocator, and
mapping the 3AC ops to machine code.

The breakdown within the BJX2 backend seems to be:
~ 30k lines, stuff related to BJX2 machine code;
Instruction emitters;
An assembler (ASM parser, pipes into emitters);
Used for inline ASM, and dealing with ASM code in general.
WEXifier;
Invoked after emitting machine code for a basic-block.
Some alternate emitter logic for RISC-V machine instructions;
(Partly mimics BJX2 ASM on the ASM level)
...
~ 40k lines, stuff for mapping 3AC operations to ASM level ops.
Stack-frame management stuff;
Register allocator;
Call/Return and 'int' operators;
...
~ 3k lines, stuff for dealing with PE/COFF
...

The logic for managing the stack frame (load/store stack variables,
prolog/epilog sequences, ...) has ended up being one of the largest and
most complicated parts of the backend, followed by things like the
register allocator and similar.

Probably "would make sense to do eventually":
Split middle-section in half (make these stages independent):
AST -> RIL
RIL -> 3AC
Try to generalize RIL enough to make it "generally useful"
Say, more usable with alternate compiler implementations.
Maybe split the BJX2 backend in half:
3AC -> ASM (BJX2)
ASM -> BJX2 (Machine Code)
May still include PE/COFF output
Maybe add a COFF linker (and COFF emitter).

I have considered efforts for RIL cleanup (or to impose some semantic
constraints more like those in MSIL / CIL), but this is itself sort of a
battle. One big semantic difference between RIL and MSIL was "who
manages type coercion": In MSIL, this part was manages by the frontend,
whereas in RIL this part was mostly left up to the stack-machine.

To be useful for multiple source languages, one would generally want the
front-end to manage type coercion though, since not every language would
use C's type rules (and for alternate implementations, one wouldn't
necessarily want the RIL decoder need to have access to C style type
promotion logic).

So, for example, in MSIL, operators would have all 3 types being equal:
int + int -> int
long + long -> long
...

Whereas, in RIL, as-is, it is more like:
short + long -> long
char* + int -> char*
...

Other than this, its structure treats functions as more or less a flat
list of instructions, with labels and jumps thrown in; any block-scoped
variables are expected to be flattened to a flat list of variables; ...

Though, otherwise, RIL and MSIL are conceptually similar.

This is quite different from WASM though, which assumes high-level
control flow constructs and does not allow labels or goto.
Whereas, in RIL, "if(A op B)goto L;" is one of the primary control flow
operators (and the 3AC IR is primarily defined relative to these
label-and-goto based structures).

The ASM stage would not likely be "actual" textual ASM, but rather a
list of machine-instructions (as structs) which may be optionally
serialized to textual ASM, or parsed from textual ASM.

The idea would be for this to replace the current practice of the
code-generation functions invoking the emitters directly.

Though, it is a mystery if some of this stuff could be modified to be
more ISA agnostic. There was some partial work in this direction at one
point, but it didn't get very far.

...

MitchAlsup

unread,

Nov 22, 2021, 1:54:10 PM11/22/21

to

In My 66000 ISA there are two 6-bit field codes, one defines the shift amount
the other defines the width of the field (0->64). For immediate shift amounts
there is a 12-bit immediate that supplies the pair of 6-bit specifiers; for register
shift amounts R<5:0> is the shift amount while R<37:32> is the field width.
(The empty spaces are checked for significance)
<
Then there is the operand-by-operand select (CMOV) and the bit-by bit
select (Multiplex)
CMOV:: Rd =(!!Rs1 & Rs2 )|(!Rs1 & Rs3 )
MUX:: Rd =( Rs1 & Rs2 )|(~Rs1 & Rs3 )

>
> Could synthesize a mask from a shift and count, harder part is coming up
> with a way to do so cheaply (when the shift unit is already in use this
> cycle).
<

It is a simple decoder........

>
>
> Could be done more simply via multiple ops:
> SHLD.Q (left as-is)
> BITSEL Rm, Ro, Rn // Rn=(Rm&Ro) | (Rn&(~Ro))
> BITSEL Rm, Ro, Rp, Rn // Rn=(Rm&Ro) | (Rp&(~Ro))
>
>
> So, extract is, say:
> SHLD.Q R4, -24, R3
> AND 511, R3
> And, insert is, say:
> MOV 511, R7
> SHLD.Q R3, 52, R6 | SHLD.Q R7, 52, R7
> BITSEL R6, R7, R8
>
>
> Still need to think about it.
<

Don't forget both signed and unsigned versions.
>
>
> Or such...

Thomas Koenig

unread,

Nov 22, 2021, 2:49:31 PM11/22/21

to

Marcus <m.de...@this.bitsnbites.eu> schrieb:

> On 2021-11-22 16:54, Thomas Koenig wrote:
>> Marcus <m.de...@this.bitsnbites.eu> schrieb:
>>
>>> A complete rebuild of binutils +
>>> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X.
>>
>> That is blindingly fast, it usually takes about half to 3/4 on an
>> hour with recent versions. Which version is your gcc port
>> based on?
>>
>
> I'm on GCC trunk (12.0). Same with binutils and newlib.
>
> I use Ubuntu 20.04, and the HW is a 3900X (12-core/24-thread) CPU, with
> an NVMe drive (~3 GB/s read).
>
> I build with "make -j28" (although the GNU toolchain build system is
> poorly parallelizable, so most of the cores are idle most of the time).
>
> BTW, my build script is here [1].

On a POWER 9 with "make -j32", I get

real 38m30.253s
user 420m38.816s
sys 7m9.639s

with Fortran, C++ and C enabled (and checking).

BGB

unread,

Nov 22, 2021, 3:06:07 PM11/22/21

to

On 11/22/2021 11:50 AM, Marcus wrote:
> On 2021-11-22 16:54, Thomas Koenig wrote:
>> Marcus <m.de...@this.bitsnbites.eu> schrieb:
>>
>>> A complete rebuild of binutils +
>>> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X.
>>
>> That is blindingly fast, it usually takes about half to 3/4 on an
>> hour with recent versions. Which version is your gcc port
>> based on?
>>
>
> I'm on GCC trunk (12.0). Same with binutils and newlib.
>
> I use Ubuntu 20.04, and the HW is a 3900X (12-core/24-thread) CPU, with
> an NVMe drive (~3 GB/s read).
>

In my case, I am running Windows 10 on a Ryzen 2700X (8-core, 16-thread,
3.7 GHz).
OS + Swap is on a 2.5" SSD ( ~ 300 MB/s )
Rest of storage is HDDs, mostly 5400 RPM drives (WD Green and WD Red).

My MOBO does not have M.2 or similar, but does have SATA connectors, so
the SSD is plugged in via SATA.

RAM Stats:
48GB of RAM, 1467MHz (DDR4-2933)
192GB of Pagefile space.

HDD speed is pretty variable, but generally falls into the range of
20-80MB/s (except for lots of small files, where it might drop down to ~
2MB/s or so).

Also kinda funny is that "file and folder compression" tends to actually
make directories full of source code or other small files somewhat
faster (so I tend to leave it on by default for drives which I primarily
use for projects).

Otherwise, it is like the whole Windows Filesystem is built around the
assumption that one is primarily working with small numbers of large
files, rather than large numbers of small files.

> I build with "make -j28" (although the GNU toolchain build system is
> poorly parallelizable, so most of the cores are idle most of the time).
>

Trying to parallel make eats up all of the RAM and swap space, so I
don't generally do so. This issue is especially bad with LLVM though.

> BTW, my build script is here [1].
>
> You may experiencing the "Windows tax" [2] - i.e. Windows is almost
> always slower than Linux (esp. when it comes to build systems).
>

I suspect because GCC and similar do excessive amounts of file opening
and spawning huge numbers of short-lived processes. These operations are
well into the millisecond range.

This was enough of an issue in BGBCC that I added a cache to keep any
previously loaded headers in-RAM, since when compiling a program it
tends to access the same headers multiple times.

Well, and also it does everything in a single monolithic process.

I had considered possibly splitting the compiler into separately loaded
components (DLLs or SOs), but didn't do so mostly as it would add more
complexity than I felt was worthwhile (vs just recompiling the compiler
all at once).

As noted, in BGBCC, the frontends and backends are managed using
FOURCC's and an interface lookup system loosely inspired by how A/V
codecs worked in Windows (so it will lookup a backend with support for a
given target architecture given as a pair of FOURCC's, a frontend to
parse a given source language, ...).

Though, the division points are a little wonky:
The Frontend/Middle interface uses BCCX ASTs;
The Middle/Backend interface uses the 3AC IR;
...

RIL isn't used at this level because, at the time I designed this part,
I had considered RIL to be vestigial (part of the current compiler
structure here was due to there being a point where I expected it to get
dissolved and go away). However, I was left with an issue of needing a
way to have static-linked libraries, and it turns out that trying to
save and reload the 3AC IR "kinda really sucked" (much more hairy and
complicated than using the stack-machine IR for static libraries).

Though, compiling stuff in BGBCC with the "debug dump" option slows down
the compiler somewhat, as it is fairly slow to dump the preprocessor
output and ASTs for each translation unit. Most of this time is
seemingly spent on IO (opening files and writing out the output data).

Though, I am not sure how much of this is due to the overhead of the
antivirus software (which auto-scans every file as it is opened or written).

Ivan Godard

unread,

Nov 22, 2021, 4:08:40 PM11/22/21

to

The encoding doesn't leave any room for expansion to 128 bit data?

MitchAlsup

unread,

Nov 22, 2021, 4:35:53 PM11/22/21

to

The 12-bit immediate encoding does not, the register encoding does. The register
encoding can supply a 64-bit immediate with two 6-32 bit fields.
Since I have no registers larger than 64-bits, the issue is moot.
<

Ivan Godard

unread,

Nov 22, 2021, 4:37:24 PM11/22/21

to

That's large by my standard; I suspect that a lot of that comes from use
of C instead of C++ as an implementation language - a judicious use of
templates *really* shrinks the source in things as regular as compilers.
Of course, the regularity of the target makes a large difference too -
and then there's commenting style. It makes it real hard to meaningfully
compare code sizes.

The tightest compiler I ever wrote was for Mary2 (an Algol68 variant).
The compiler could compile itself on a DG Nova 1200 in 64k memory - that
you shared with the OS. That compiler is how Mitch and I first met.

> ASTs (Parser, AST->RIL) use a system I had been calling BCCX, which is
> organized in terms of nodes. In an abstract sense, a node contains:
> A collection of key/value attributes;
> A collection of zero or more child nodes.

I avoid generating an AST at all; LL1 recursive descent makes it easy to
apply semantics during parse, so you go direct to IR.

> BCCX-1 (Prior):
> Supported 1-6 attributes directly;
> Going beyond 6 attributes would use an array;
> Nodes were organized via linked-lists.

Linked structures are very costly as soon as the representation exceeds
the D1. For a production compiler I take a tip from the database world
and organize things in arrays, with sorts followed by linear passes.
Looks very much like the guts of a relational database doing joins.

There's a lot of lit on generating SSA, and LLVM is open source.

> The variables are split into several numbering spaces:
> R/T: Temporaries (12.12 bit space);
> A: Arguments (12.12 bit space);
> L: Locals (12.12 bit space);
> G: Globals (24-bit space);
> I: Immediate (encoded inline or via an index into a table);
> S: String Literals;
> ...
>
> The bit-packing scheme encodes embedded value types, which actually
> takes up the majority of the bits in the format (based on 64 bit integer
> values).
>
> As can be noted, a full-width 64-bit or 128-bit constant can't be
> encoded directly via this scheme, but can be mapped to table lookups
> (128-bit constants are encoded via pairs of entries in the 64-bit
> constant table). Most smaller integer constants and similar can be
> encoded directly without needing to store them in the table.

One advantage to writing in C++ is the destructors are injected for you.

> In the backend, it is mostly about using a register allocator, and
> mapping the 3AC ops to machine code.
>
> The breakdown within the BJX2 backend seems to be:
> ~ 30k lines, stuff related to BJX2 machine code;
>       Instruction emitters;
>       An assembler (ASM parser, pipes into emitters);
>         Used for inline ASM, and dealing with ASM code in general.
>       WEXifier;
>         Invoked after emitting machine code for a basic-block.
>       Some alternate emitter logic for RISC-V machine instructions;
>         (Partly mimics BJX2 ASM on the ASM level)
>       ...
> ~ 40k lines, stuff for mapping 3AC operations to ASM level ops.
>       Stack-frame management stuff;
>       Register allocator;
>       Call/Return and 'int' operators;
>       ...
> ~ 3k lines, stuff for dealing with PE/COFF
> ...
>
> The logic for managing the stack frame (load/store stack variables,
> prolog/epilog sequences, ...) has ended up being one of the largest and
> most complicated parts of the backend, followed by things like the
> register allocator and similar.

When you co-develop the compiler and the architecture together then you
can push a lot of this off onto the architecture. Our chain would be
thrice the size it it were targeting a legacy architecture.

The need for an obligatory asm step disappeared with the first 1MB
hosts. Everybody emits your choice of text or binary or both from the
same IR.

> Though, it is a mystery if some of this stuff could be modified to be
> more ISA agnostic. There was some partial work in this direction at one
> point, but it didn't get very far.
>
>
> ...

What you have succeeded in doing - architecture, compiler, even the
postings here - is truly impressive.

BGB

unread,

Nov 22, 2021, 5:43:48 PM11/22/21

to

I did go and add a bit-select instruction (BITSEL / MUX).

Currently I have:
CSELT // Rn = SR.T ? Rm : Ro;
PCSELT.L // Packed select, 32-bit words (SR.ST)
PCSELT.W // Packed select, 16-bit words (SR.PQRO)
BITSEL // Rn = (Rm & Ro) | (Rn & ~Ro);

BITSEL didn't add much cost, but did initially result in the core
failing timing.

Though disabling the "MOV.C" instruction saves ~ 2k LUT and made timing
work again.

The MOV.C instruction has been demoted some, as:
It isn't entirely free;
It's advantage in terms of performance is fairly small;
It doesn't seem to play well with TLB Miss interrupts.

Though, a cheaper option would be "MOV.C that only works with GBR and LR
and similar" (similar effect in practice, but avoids most of the cost of
the additional signal routing).

I am also on the fence and considering disallowing using bit-shift
operations in Lane 3, mostly as a possible way to reduce costs by not
having a (rarely used) Lane 3 shift unit.

Still on the fence though as it does appear that shifts operators in
Lane 3 aren't exactly unused either (so the change breaks compatibility
with my existing binaries).

>>
>> Could synthesize a mask from a shift and count, harder part is coming up
>> with a way to do so cheaply (when the shift unit is already in use this
>> cycle).
> <
> It is a simple decoder........

In theory, it maps nicely to a 12-bit lookup table, but a 12-bit lookup
table isn't super cheap.

One could have it as min/max values, allowing the problem to be reduced
to decisions in terms of the output bits.

Still would likely cost an annoying number of LUTs though (~ 200 LUTs by
current estimates). Though, still a lot cheaper than a dedicated shift unit.

I guess the open question is partly a choice between:
32-bit encoding which can specify a wide range of simple bit-masks;
Not do anything, just stick with existing encodings (using a 96-bit
encoding if need be).

>>
>>
>> Could be done more simply via multiple ops:
>> SHLD.Q (left as-is)
>> BITSEL Rm, Ro, Rn // Rn=(Rm&Ro) | (Rn&(~Ro))
>> BITSEL Rm, Ro, Rp, Rn // Rn=(Rm&Ro) | (Rp&(~Ro))
>>
>>
>> So, extract is, say:
>> SHLD.Q R4, -24, R3
>> AND 511, R3
>> And, insert is, say:
>> MOV 511, R7
>> SHLD.Q R3, 52, R6 | SHLD.Q R7, 52, R7
>> BITSEL R6, R7, R8
>>
>>
>> Still need to think about it.
> <
> Don't forget both signed and unsigned versions.

Signed extract can be done with 2 shifts.
For insert, there shouldn't be any practical difference between signed
and unsigned.

>>
>>
>> Or such...

MitchAlsup

unread,

Nov 22, 2021, 5:54:42 PM11/22/21

to

On Monday, November 22, 2021 at 3:37:24 PM UTC-6, Ivan Godard wrote:
> On 11/22/2021 10:50 AM, BGB wrote:
> > On 11/22/2021 2:06 AM, Ivan Godard wrote:

> >
> > The logic for managing the stack frame (load/store stack variables,
> > prolog/epilog sequences, ...) has ended up being one of the largest and
> > most complicated parts of the backend, followed by things like the
> > register allocator and similar.
> When you co-develop the compiler and the architecture together then you
> can push a lot of this off onto the architecture. Our chain would be
> thrice the size it it were targeting a legacy architecture.
<

What to push into HW and what to push into SW is a careful balancing act.
<
One of the things I do in my simulators, is that all of the executable instructions
are in a single file, and if you comment one (or more) instruction out, then the
simulator converts the working instruction into a INVALID. This makes it easy to
test if the compiler has used/not-used any given instruction.
<
I did succumb to putting code-density improving instructions into My 66000 ISA
particularly in the epilogue and prologue sections. Basically, if all you need if
a few temporary registers, the arguments, and a small local stack frame, then
you do this is SW as instructions. However, if you need registers saved/restored,
FP setup (or not), dynamic arrays, list of structures constructed,... Then ENTER
and EXIT help out code density immensely.

MitchAlsup

unread,

Nov 22, 2021, 6:04:13 PM11/22/21

to

In general purpose codes, shifts are "not all that present" 2%-5% range (source
code). In one 6-wide machine, we stuck an integer unit in each of the 6-slots
but we borrowed the shifters in the LD-Align stage for shifts--so only 3 slots
could perform shifts, while all 6 could do +-&|^ . MUL and DIV were done in the
multiplier (slot[3]).
<
Shifters are on-the-order-of gate cont expensive as integer adders and less useful.

>
> Still on the fence though as it does appear that shifts operators in
> Lane 3 aren't exactly unused either (so the change breaks compatibility
> with my existing binaries).
<

You should see what the cost is if only lanes[0..1] can perform shifts.

> >>
> >> Could synthesize a mask from a shift and count, harder part is coming up
> >> with a way to do so cheaply (when the shift unit is already in use this
> >> cycle).
> > <
> > It is a simple decoder........
> In theory, it maps nicely to a 12-bit lookup table, but a 12-bit lookup
> table isn't super cheap.
<

input.....|.............................output..................................
000000 | 0000000000000000000000000000000000000000000000000000000000000000
000001 | 0000000000000000000000000000000000000000000000000000000000000001
000010 | 0000000000000000000000000000000000000000000000000000000000000011
000011 | 0000000000000000000000000000000000000000000000000000000000000111
000100 | 0000000000000000000000000000000000000000000000000000000000001111
etc.
It is a straight "Greater Than" decoder.
>

BGB

unread,

Nov 22, 2021, 10:28:13 PM11/22/21

to

Yeah, it is plain C, mostly because that is what I use.
For the most part, it is actually a C89/C95 style.

While in theory I could have used C++, at nearly every point, there was
something that ruined it for me:
Back in my early years, trying to use "g++" in Cygwin was basically
playing with a hand grenade (*1).

*1: The g++ compiler would blow up without warning and often nuke itself
in the process; and it was pretty hit or miss as to whether reinstalling
Cygwin would actually fix it, or for how long.

Later on, it was due to to C++'s higher complexity interfering with the
ability to interface it with my own technology, and C++ not really
offering enough by itself to justify the added hassle.

Then one can note that it only compiles quickly if one pretends they are
basically using C with a few extra features, but pretty much as soon as
one even so much as look at the STL, then ones' build times fall into
the toilet.

>> ASTs (Parser, AST->RIL) use a system I had been calling BCCX, which is
>> organized in terms of nodes. In an abstract sense, a node contains:
>> A collection of key/value attributes;
>> A collection of zero or more child nodes.
>
> I avoid generating an AST at all; LL1 recursive descent makes it easy to
> apply semantics during parse, so you go direct to IR.
>

It is partly historical legacy.

But, yeah, I am using a recursive-descent parsing strategy in this case.

My first compiler/interpreter was based on Scheme, and parsed directly
to cons lists / S-Expressions. The interpreter would recursively walk
lists and 'eval'/'apply' the expressions.

This was based on a tagged pointer format (from memory):
Tagged pointers were 32 bits (this was on 32-bit x86);
Low 2 bits were used as a tag:
Pointer / Fixnum / Cons / Special
Cons cells were 8 bytes, encoded as a pair of pointers.
They would be linked linearly into lists.

Memory was primarily organized into cells managed by an allocation
bitmap, with a mark/sweep collector.

When I later implemented my first BGBScript VM, I went with XML and DOM.
This was because XML seemed to be fairly flexible, and I was generally
imitating JavaScript, ...

As before, the interpreter was based on a recursive tree walk, but with
DOM trees, at the time it was painfully slow (even fairly trivial
scripts were enough to make its slowness quite obvious).

The above is what was later forked to create BGBCC.

On the other side of things, BGBScript was later rewritten to reuse a
core based on the prior Scheme interpreter (and the parser would go
fairly directly from JS notation to Scheme S-Expressions). But, this
time around I used a stack-based bytecode interpreter (based around a
"while-switch" loop).

BGBCC had combined the XML based parser from the first VM, with the
bytecode format from the second VM (which makes it a fairly direct
ancestor of the current RIL bytecode). Originally, I had intended to use
it as another script interpreter, but this didn't work out so well.

Some things I had learned from BGBCC were back-ported to the BGBScript
VM, and along with borrowing a bunch of stuff borrowed from
ActionScript, and moving the core VM from being dynamically typed to
statically typed. My first major 3D engine used this VM.

I then ended up creating a redesigned version of BGBScript which I had
called BGBScript 2; which switched the language over to being primarily
statically typed and using a more Java style syntax design (in favor of
the ActionScript derived syntax of its predecessor). This version had
switched to a JSON derived model for the ASTs (in place of the Scheme
based core).

This VM was also built around the realization that I could still use a
stack-based bytecode but then dynamically translate it to a 3AC internal
form for the interpreter.

After this, I started in my BJX1 and later BJX2 projects, and hence
"knocking the dust off of BGBCC".

I make no claim that XML is a good basis for ASTs, but ultimately, that
is what I have in this case. Was at least able to make it "somewhat less
terrible". I also tried to incorporate some of the things I learned in
the BGBScript2 VM effort.

But, the result is partly that its codebase is kinda wonky.

I initially started writing BCCX2 as an attempt to hopefully move away
from the XML based abstract model, but this fizzled out when I noted
that doing so would require rewriting pretty much the entire compiler
frontend, whereas what I got was a lot less drastic (but did address
some of the worst of the performance and footprint issues with the prior
AST system).

Similarly, some other compilers had used bare structs or tagged unions
as AST nodes, which is pretty much the complete opposite approach from
trying to do so with XML nodes or similar.

It is also possible that an AST system could be built more on abstract
handles, predicates, and getter/setter functions, more like what I did
for many of the structures in the back-end (well, more so than it is
already).

Maybe replace the use of bare strings and int-pointers with accessing
attributes via opaque handles, and the use of node pointers with node
handles, ...

>> BCCX-1 (Prior):
>>    Supported 1-6 attributes directly;
>>    Going beyond 6 attributes would use an array;
>>    Nodes were organized via linked-lists.
>
> Linked structures are very costly as soon as the representation exceeds
> the D1. For a production compiler I take a tip from the database world
> and organize things in arrays, with sorts followed by linear passes.
> Looks very much like the guts of a relational database doing joins.
>

One of the big problems with the use of linked lists was that a node
could only be linked into a single tree at a time, and had to be
"cloned" to be reused in multiple sub-trees, or during
expression-rewriting tasks, which wasted considerable amounts of time
and memory.

I since have eliminated the use of linked lists from the ASTs, as noted,
in favor of a radix array structure.

So, for example:
1-level tree: 0-15 nodes
2-level tree: 16-255 nodes
3-level tree: 256-4095 nodes
4-level tree: 4096-65535 nodes
...

This structure also avoids the need for large contiguous memory arrays,
which are problematic for the highly volatile nature of AST
manipulations. Granted, access is O(log n) rather than O(1), but this
isn't a huge issue.

The ability to reuse nodes in multiple sub-trees and sub-expressions
without cloning has significantly reduced the memory overhead of the
compiler's ASTs.

Using BCCX is sorta like:
n=BCCX_NewCst(&bgbcc_rcst_type, "type");
BCCX_SetCst(n, &bgbcc_rcst_name, "name", s1);
BCCX_SetIntCst(n, &bgbcc_rcst_flags, "flags", fl);
BCCX_SetIntCst(n, &bgbcc_rcst_ind, "ind", 0);
...
n1=BGBCP_DefExpectType(ctx, &s);
BCCX_AddV(n, BCCX_NewCst1(&bgbcc_rcst_super, "super", n1));
...

if(BCCX_TagIsCstP(l, &bgbcc_rcst_getindex, "getindex") ||
BCCX_TagIsCstP(l, &bgbcc_rcst_vector_ref, "vector-ref"))
{
ct=BCCX_FetchCst(l, &bgbcc_rcst_array, "array");
cv=BCCX_FetchCst(l, &bgbcc_rcst_index, "index");
ct=BGBCC_CCXL_ReduceExpr(ctx, ct);
cv=BGBCC_CCXL_ReduceExpr(ctx, cv);
...
...

But, it was at least "high level" enough to where I was able to replace
the linked-list structures with array-like structures with only a modest
level of rewrite.

Side-note:
Pardon the use of 1 and 2 letter variable names...
There are a *lot* of those in BGBCC...

I was aware of LLVM, but it was large / complicated looking, and takes a
very long time to recompile, which was discouraging.

I am also not really a fan of the whole "Modern C++" thing...

>> The variables are split into several numbering spaces:
>>    R/T: Temporaries (12.12 bit space);
>>    A: Arguments (12.12 bit space);
>>    L: Locals (12.12 bit space);
>>    G: Globals (24-bit space);
>>    I: Immediate (encoded inline or via an index into a table);
>>    S: String Literals;
>>    ...
>>
>> The bit-packing scheme encodes embedded value types, which actually
>> takes up the majority of the bits in the format (based on 64 bit
>> integer values).
>>
>> As can be noted, a full-width 64-bit or 128-bit constant can't be
>> encoded directly via this scheme, but can be mapped to table lookups
>> (128-bit constants are encoded via pairs of entries in the 64-bit
>> constant table). Most smaller integer constants and similar can be
>> encoded directly without needing to store them in the table.
>
> One advantage to writing in C++ is the destructors are injected for you.
>

Most of this is wrapped, so code is like:
li=BGBCC_CCXL_GetRegImmLongValue(ctx, treg);

And the value of an immediate can be fetched regardless of its specific
storage. In this case, it fetches the value of a 'Long' immediate.

Then lots of type-predicates like:
if(BGBCC_CCXL_TypeQuadPointerP(ctx, type)) //*1
if(BGBCC_CCXL_TypeSgInt128P(ctx, type)) //*2

*1: 128-bit pointer (96 bit virtual address).
*2: 128-bit integer, either sign.

Pretty much whole system is wrapper functions and predicates...

So, say, we have a "type", we can check its properties with various
predicates, or do transformations (de-referencing a type, taking
address-of-type, getting the return-value of a function type, ...).

Most of the internal bit-twiddly is hidden away.

There are actually several different bit patterns for the various fields
within the type:
A version that balances the size of the base type with the maximum array
bounds;
A version which uses a bigger array-size limit but smaller type field
and fewer levels of pointer indirection;
A version which has a small array field but bigger base type field (such
as for object pointers and function pointers);
A version that replaces the array size with a pointer class field (so,
can encode volatile/restrict/... pointers, but not arrays);
A version which overflows the type field into an index into a "type
overflow table".

Similarly for "register" handles, which are also somewhat bit-twiddled.
Lots of predicates and handler functions for these as well.

Then again, the main alternative (raw/exposed bit twiddly for all this)
would likely be far worse...

Typically, the lifespan of all of this is tied to the "compiler
context", but as noted, some stuff recently is also tied to the "zone".

As-is, BJX2 is fairly minimal regarding what the hardware manages.
The compiler needs to figure out stack frame size and layout, which
registers to save, whether to reuse the prolog and epilog from prior
functions (or fold out the current prolog such that it can be reused), ...

There is also some amount of vestigial logic from the original ISA's
(SH4 and BJX1) from which the BJX2 backend was originally a fork.

The idea here is that this would allow some amount of instruction-level
optimization that does not depend on needing to shuffle instructions
around in already emitted machine-code sequences.

>> Though, it is a mystery if some of this stuff could be modified to be
>> more ISA agnostic. There was some partial work in this direction at
>> one point, but it didn't get very far.
>>
>>
>> ...
>
> What you have succeeded in doing - architecture, compiler, even the
> postings here - is truly impressive.

Possibly...

BGB

unread,

Nov 23, 2021, 12:21:41 AM11/23/21

to

All 3 lanes still have other ALU ops, including some type-conversion and
packed-integer SIMD ops.

>>
>> Still on the fence though as it does appear that shifts operators in
>> Lane 3 aren't exactly unused either (so the change breaks compatibility
>> with my existing binaries).
> <
> You should see what the cost is if only lanes[0..1] can perform shifts.

Disabling the Lane 3 shifter seems to save ~ 1250 LUTs.
It also makes timing a little better.

I guess the main factor is whether it justifies the break in binary
compatibility (since previously, the WEXifier assumed that running
shifts in Lane 3 was allowed; and this seems to have occurred a non-zero
number of times in pretty much every binary I have looked at).

>>>>
>>>> Could synthesize a mask from a shift and count, harder part is coming up
>>>> with a way to do so cheaply (when the shift unit is already in use this
>>>> cycle).
>>> <
>>> It is a simple decoder........
>> In theory, it maps nicely to a 12-bit lookup table, but a 12-bit lookup
>> table isn't super cheap.
> <
> input.....|.............................output..................................
> 000000 | 0000000000000000000000000000000000000000000000000000000000000000
> 000001 | 0000000000000000000000000000000000000000000000000000000000000001
> 000010 | 0000000000000000000000000000000000000000000000000000000000000011
> 000011 | 0000000000000000000000000000000000000000000000000000000000000111
> 000100 | 0000000000000000000000000000000000000000000000000000000000001111
> etc.
> It is a straight "Greater Than" decoder.

One needs a greater-than comparison, a less-than comparison, and a way
to combine them with a bit-wise operator (AND or OR), for each bit.

Say, min, max:
min<=max: Combine with AND
min >max: Combine with OR

This would allow composing masks with a range of patterns, say:
00000000000003FF
7FE0000000000000
000003FFFFC00000
7FFE000000003FFF
...

Granted, this isn't all that steep (vs the core as a whole), but I guess
it is more a question of if it would be useful enough to make it worthwhile.

Though, would need to find an "odd corner" to shove it into.
Otherwise, BJX2 lacks any encodings for "OP Imm12, Rn".

Have spots for Imm10 (insufficient), and Imm16 (but no, not going to
spend an Imm16 spot on this; there are only a few of these left).

Have a few possible ideas (involving XGPR encodings, such as reclaiming
the encoding space for branches for Imm12 ops because XGPR+BRA does not
make sense), but they seem like ugly hacks.

Similarly, outside of being able to fit in a 32-bit encoding, it loses
any real advantage it might have had. Jumbo encodings can load pretty
much any possible constant.

>>
>

robf...@gmail.com

unread,

Nov 23, 2021, 1:54:51 AM11/23/21

to

I have put a significant portion of a man-year into my own compiler over the last 25+ years. Not
really production quality code, hobby quality with lots of ‘makes it work’ code. I think it is about
80k lines of code. A combination of C and C++. It started out as pure C. I have not yet used it for
anything but simple demos. The compiler has limited support for some C++ isms like classes
with methods and method overloading. But does not support operator overloading, templates and
many other things associated with C++. The compiler makes use of two types of parse trees
which would be analogous to an AST. Expression nodes and statement nodes. As it generates code
it builds a linked list of code nodes. Optimizations are applied to all different types of nodes.
I got started working on it for a 68000 co-processor board for the XT and it has mutated through
several different processors. I started work before the internet and before I found out about other
projects like gcc. I decided to put more work into my own compiler that I already knew rather than
learn the workings of gcc. The most recent addition to the compiler is support for co-routines. I
think I can come up with a way to support some recursion in co-routines, by invoking a co-routine
again instead of just using ‘yield’ but I must think about it some more. I keep toying with the idea
of trying to make it more mainstream, perhaps by adding support for other mainstream processors.
The compiler has several nifty features like bit-slicing and enumerator generators.

For field extracts Thor supports using either a register or a seven-bit constant to specify the field
offset and width. Most Thor instructions are using a 48-bit encoding, allowing room for more options
when dealing with operands. One operand bit indicates a register or constant value. Virtually any
instruction can be a vector instruction as there is an opcode bit dedicated to indicating a vector
operation.
Thor should be extendable to 128-bits with a little bit of work. 128-bit constants can be formed
using the appropriate prefix instructions. The plan is to support 128-bit decimal floating-point.

Marcus

unread,

Nov 23, 2021, 3:08:18 AM11/23/21

to

On 2021-11-22 21:05, BGB wrote:
> On 11/22/2021 11:50 AM, Marcus wrote:
>> On 2021-11-22 16:54, Thomas Koenig wrote:
>>> Marcus <m.de...@this.bitsnbites.eu> schrieb:
>>>
>>>> A complete rebuild of binutils +
>>>> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X.
>>>
>>> That is blindingly fast, it usually takes about half to 3/4 on an
>>> hour with recent versions. Which version is your gcc port
>>> based on?
>>>
>>
>> I'm on GCC trunk (12.0). Same with binutils and newlib.
>>
>> I use Ubuntu 20.04, and the HW is a 3900X (12-core/24-thread) CPU, with
>> an NVMe drive (~3 GB/s read).
>>
>
> In my case, I am running Windows 10 on a Ryzen 2700X (8-core, 16-thread,
> 3.7 GHz).
> OS + Swap is on a 2.5" SSD ( ~ 300 MB/s )
> Rest of storage is HDDs, mostly 5400 RPM drives (WD Green and WD Red).
>
> My MOBO does not have M.2 or similar, but does have SATA connectors, so
> the SSD is plugged in via SATA.
>
> RAM Stats:
> 48GB of RAM, 1467MHz (DDR4-2933)
> 192GB of Pagefile space.
>
> HDD speed is pretty variable, but generally falls into the range of
> 20-80MB/s (except for lots of small files, where it might drop down to ~
> 2MB/s or so).

I believe that Windows is particularly sensitive to slow/spinning
drives. My impression is that Linux is generally better at caching disk
data in RAM, which hides some of the slowness (esp. when working with
small files).

If you have spare PCIe slots (preferably PCIe 3.0 x4), you could always
look into using a PCIe M.2 adapter. That together with a 1 TB 3000 MB/s
drive should set you back less than $100. It's a game changer for
productivity.

>
> Also kinda funny is that "file and folder compression" tends to actually
> make directories full of source code or other small files somewhat
> faster (so I tend to leave it on by default for drives which I primarily
> use for projects).
>
> Otherwise, it is like the whole Windows Filesystem is built around the
> assumption that one is primarily working with small numbers of large
> files, rather than large numbers of small files.
>

Yes. Actually, it seems to built around the assumption that you should
not access the file system at all if possible. Where you'd typically use
files for inter-process data communication in a Unix environment you are
often better off using direct IPC and in-memory data stores in Windows,
just to avoid the penalty of going via all of the file system layers.
You typically also want to use long-running persistent processes on
Windows, whereas in Linux or macOS it's fine to start thousands of
processes per second.

I think this is why tools that originate in Unix-like environments (e.g.
GCC, Git, CMake, bash, autotools, Apache, PostgreSQL, ...) always run
faster in Linux than in Windows - they were designed under the
assumption that file accesses and process creation are "free".

>
>> I build with "make -j28" (although the GNU toolchain build system is
>> poorly parallelizable, so most of the cores are idle most of the time).
>>
>
> Trying to parallel make eats up all of the RAM and swap space, so I
> don't generally do so. This issue is especially bad with LLVM though

Yeah, you have to have the RAM to support parallelism. You can always
try to use less processes though (e.g. -j2 or -j3). Also, avoid plain
"-j" as that just starts as many processes as possible without caring
about how many cores you have.

Unlike the GCC build system, the LLVM build system is near perfectly
parallelizable.

>
>
>> BTW, my build script is here [1].
>>
>> You may experiencing the "Windows tax" [2] - i.e. Windows is almost
>> always slower than Linux (esp. when it comes to build systems).
>>
>
> I suspect because GCC and similar do excessive amounts of file opening
> and spawning huge numbers of short-lived processes. These operations are
> well into the millisecond range.
>

Yep.

> This was enough of an issue in BGBCC that I added a cache to keep any
> previously loaded headers in-RAM, since when compiling a program it
> tends to access the same headers multiple times.
>
> Well, and also it does everything in a single monolithic process.
>

That's the Windows-first approach: Keep everything in one large
monolithic executable & process. It works, but I happen to like the more
modular approach of Unix-like systems, which makes it easy to leverage
existing tools (e.g. BuildCache [1]).

/Marcus

[1] https://github.com/mbitsnbites/buildcache

>
>

[snip]

>
>

Marcus

unread,

Nov 23, 2021, 4:16:48 AM11/23/21

to

I once had the opportunity to benchmark a relatively pricey POWER 9
system against an AMD 3950X.

IBM POWER 9:
- 2 sockets x 12 cores/socket x 4 threads/core = 96 HW threads
- 2.9 GHz under load (pretty high for a server CPU)

AMD 3950X:
- 1 socket x 16 cores/socket x 2 threads/core = 32 HW threads
- 3.9 GHz under load

I built LLVM using GCC, which saturated all cores, and to my surprise
the 3950X was faster than the POWER 9:

POWER 9: Build time 7m40s
3950X: Build time 6m45s

So I no longer consider POWER to be a real alternative, especially
given the price delta (the 3950X system cost less than $1500, the
POWER 9 system didn't).

/Marcus

BGB

unread,

Nov 23, 2021, 4:38:37 AM11/23/21

to

On 11/23/2021 12:54 AM, robf...@gmail.com wrote:
> I have put a significant portion of a man-year into my own compiler over the last 25+ years. Not
> really production quality code, hobby quality with lots of ‘makes it work’ code. I think it is about
> 80k lines of code. A combination of C and C++. It started out as pure C. I have not yet used it for
> anything but simple demos. The compiler has limited support for some C++ isms like classes
> with methods and method overloading. But does not support operator overloading, templates and
> many other things associated with C++. The compiler makes use of two types of parse trees
> which would be analogous to an AST. Expression nodes and statement nodes. As it generates code
> it builds a linked list of code nodes. Optimizations are applied to all different types of nodes.
> I got started working on it for a 68000 co-processor board for the XT and it has mutated through
> several different processors. I started work before the internet and before I found out about other
> projects like gcc. I decided to put more work into my own compiler that I already knew rather than
> learn the workings of gcc. The most recent addition to the compiler is support for co-routines. I
> think I can come up with a way to support some recursion in co-routines, by invoking a co-routine
> again instead of just using ‘yield’ but I must think about it some more. I keep toying with the idea
> of trying to make it more mainstream, perhaps by adding support for other mainstream processors.
> The compiler has several nifty features like bit-slicing and enumerator generators.
>

Yeah, BGBCC isn't quite that old.

My first Scheme interpreter was ~ 2002 or so, when I was in high school.

The first BGBScript VM was written ~ 2004, shortly after I graduated HS.
I did a partial rewrite of the VM ~ 2006 (where it was recombined with
the Scheme interpreter).

The BGBCC fork initially happened the 2007/2008 timeframe.
I got it sorta able to compile C, but it was kind of a fail at its
original purpose, and was difficult to debug, ... My younger self had
sorta underestimated the jump in compiler complexity from a JavaScript
style language to something like C.

It does support a C++ subset:
classes (single inheritance);
operator overloading;
namespaces;
...
But, does not support templates or similar.

C is generally a higher priority in my case.

BGBCC was then mostly dormant until ~ 2017, when my BJX1 project got
started. One of the first non-trivial programs I tried compiling with it
was Quake.

I have some videos on YouTube from that era (several years ago), eg:
https://www.youtube.com/watch?v=xe5T5rhvAtU
(At this time, Quake was still pretty broken).

Note that the only reason Quake ran at playable speeds was because the
BJX1 emulator was not trying to be cycle accurate, and could emulate
processors much faster than what is viable on an FPGA. A faster BJX2
emulator could be made, but would need to abandon things like
cycle-counting and modeling cache misses and similar.

The second major program I faced off with was Doom, which had ended up
as the primary test program mostly because it was a lot easier to make
Doom playable well at 50MHz than Quake (single-digit framerates are "teh
suck").

> For field extracts Thor supports using either a register or a seven-bit constant to specify the field
> offset and width. Most Thor instructions are using a 48-bit encoding, allowing room for more options
> when dealing with operands. One operand bit indicates a register or constant value. Virtually any
> instruction can be a vector instruction as there is an opcode bit dedicated to indicating a vector
> operation.
> Thor should be extendable to 128-bits with a little bit of work. 128-bit constants can be formed
> using the appropriate prefix instructions. The plan is to support 128-bit decimal floating-point.
>

OK. I am dealing with 64-bit registers, but any 128-bit operations
mostly exist by operating on registers in pairs.

No plans for Decimal FP as of yet.
Had tried to support a tweaked S.E15.M80 format, but it was "kinda
expensive" compared with an FPU that only does Double.

ADDX/SUBX: Technically two ADD operations in parallel with a carry flag
routed between the ALUs.

Had figured out a trick for Carry-Select adders which I will call
"handle the carry propagation first". If the adder is organized so that
the carry bits are determined first and then used to select the results
after the fact, then large adders can get a nice speedup.

ANDX/ORX/XORX: Technically, just the same operation being run in
parallel on 2 lanes.

SHADX/SHLDX: This got a little interesting. I had replaced the original
barrel shift units with funnel-shift units. These units could emulate
the original barrel shift ops, perform rotates, and (via slight
trickery) be ganged up to perform a 128-bit shift (it looks like a
128-bit shift, but is actually two 64-bit shifts running in parallel).

Ironically, these funnel shifters were actually cheaper than the
original barrel shifters.

So, each shifter might select two inputs (as a "high" and "low") half
(forming a 128 bit value), then produce an output using a sliding 64-bit
window spanning these two halves.

So, for example, 64-bit ops:
A SHL B: High Bits = Input, Low bits = Zero , Offset = 64-B
A SHR B: High Bits = SExt , Low bits = Input, Offset = B
A ROL B: High Bits = Input, Low bits = Input, Offset = 64-B
A ROR B: High Bits = Input, Low bits = Input, Offset = B

For 128-bit shifts, each shifter also needs access to the other half of
the input value. The logic is a little more complicated (depends on both
whether we are Lane1 or Lane2, Bit 6 of the shift, ...).

A SHLX B:
B[6]==0:
Lane 1: High Bits = InputHi, Low bits = InputLo, Offset = 64-B[5:0]
Lane 2: High Bits = InputLo, Low bits = Zero , Offset = 64-B[5:0]
B[6]==1:
Lane 1: High Bits = InputLo, Low bits = Zero , Offset = 64-B[5:0]
Lane 2: High Bits = Zero , Low bits = Zero , Offset = 64-B[5:0]

A SHRX B:
B[6]==0:
Lane 1: High Bits = SExt , Low bits = InputHi, Offset = B[5:0]
Lane 2: High Bits = InputHi, Low bits = InputLo, Offset = B[5:0]
B[6]==1:
Lane 1: High Bits = SExt , Low bits = SExt , Offset = B[5:0]
Lane 2: High Bits = SExt , Low bits = InputHi, Offset = B[5:0]

ROLX / RORX can be left as an exercise for the reader.

The actual scenario in my case is a little more complicated, but it is a
similar idea.

...

Marcus

unread,

Nov 23, 2021, 5:39:30 AM11/23/21

to

I'm curious: Do you have any public documentation of the Thor project?

BTW, in the 1990s Swedish SAAB Ericsson Space (later RUAG Space) had a
stack oriented RISC microprocessor called Thor. IIRC the stack based
approach aimed to minimize the use of on-chip memory cells (e.g. as
would be required by a register file), in order to reduce the risk of
bit flips due to ion radiation etc, which is common in space
environments (and I think that radiation hardened external RAM was a
thing).

You can still find some public documents, e.g:

- https://www.cse.chalmers.se/~johan/publications/Folkesson_FTCS28.pdf
- https://www.cse.chalmers.se/~johan/publications/ewdc-8.frm.pdf
- http://www.artes.uu.se/project/9811-4/nlft-statusrapport.990817.pdf
- https://dl.acm.org/doi/10.1145/126551.126564

I assume that the name Thor was chosen to reflect the Nordic heritage of
the company.

It was probably superseded by ERC32/LEON (radiation hardened SPARC).

/Marcus

robf...@gmail.com

unread,

Nov 23, 2021, 8:07:07 AM11/23/21

to

On Tuesday, November 23, 2021 at 5:39:30 AM UTC-5, Marcus wrote:
> On 2021-11-23 kl. 07:54, robf...@gmail.com wrote:
> I'm curious: Do you have any public documentation of the Thor project?
>

> /Marcus

The most recent version of the project (Thor2021) is publicly available at:
https://github.com/robfinch/Thor/tree/main/Thor2021

EricP

unread,

Nov 23, 2021, 9:10:31 AM11/23/21

to

Which SystemVerilog compiler do you use?

David Brown

unread,

Nov 23, 2021, 9:24:15 AM11/23/21

to

Large gcc builds can easily be two or three times faster on Linux than
on Windows, in my experience. gcc comes from a *nix heritage, where
spawning new processes is cheap and fast so a single run of "gcc" may
involve starting dozens of processes connected via pipes or temporary
files. On Windows, starting a new process is a slow matter, and using
temporary files often means the file has to actually be written out to
disk - you don't have the same tricks for telling the OS that the file
does not have to be created on disk unless you run out of spare memory.
So a Windows-friendly compiler will be a monolithic program that does
everything, while gcc is much more efficient on a *nix system. In
addition, handling lots of small files is vastly faster on *nix than
Windows. (Though I believe Win10 was an improvement here compared to Win7.)

Marcus

unread,

Nov 23, 2021, 9:25:27 AM11/23/21

to

Thanks!

That's impressive work. Great!

/Marcus

robf...@gmail.com

unread,

Nov 23, 2021, 9:48:34 AM11/23/21

to

robf...@gmail.com wrote:
>> On Tuesday, November 23, 2021 at 5:39:30 AM UTC-5, Marcus wrote:
>>> On 2021-11-23 kl. 07:54, robf...@gmail.com wrote:
>>> I'm curious: Do you have any public documentation of the Thor project?
>>>
>>> /Marcus
>>
>> The most recent version of the project (Thor2021) is publicly available at:
>> https://github.com/robfinch/Thor/tree/main/Thor2021
>>

>Which SystemVerilog compiler do you use?

The one provided in the Xilinx Vivado toolset.

Note to build the system there are a number of IP Catalog cores that need to be
setup. These would be provided in .XCI files but I've not transferred them to the
repository (yet).

Marcus

unread,

Nov 23, 2021, 10:33:48 AM11/23/21

to

Possibly (Win10 vs Win7), but it's still not even in the same ballpark
as Linux:

https://www.bitsnbites.eu/benchmarking-os-primitives/

File creation is ~60x slower on Windows 10 compared to Linux (highly
dependent on things like Defender and indexing services though), and
launching processes is ~20x slower (in the tests in that article).

Thomas Koenig

unread,

Nov 23, 2021, 12:18:08 PM11/23/21

to

An AMD 3950 is one generation ahead of a POWER9, so it stands to
reason that it is faster.

For CFD, there was a time window a couple of years ago when the
Talos II machines actually beat the performance of x86 chips,
so a colleague actually bought a POWER9 system for that application.
The program in question was OpenFOAM.

Then AMD added memory channels, and the performance advantage relative
to the newest generation was gone :-)

(POWER seems to be very good ad attaching GPUs, but OpenFOAM doesn't use
that, and anyway GPUs are of little help for CFD if the memory and
memory bandwith requirements are high).

George Neuner

unread,

Nov 23, 2021, 12:38:51 PM11/23/21

to

On Tue, 23 Nov 2021 09:08:14 +0100, Marcus
<m.de...@this.bitsnbites.eu> wrote:

>I believe that Windows is particularly sensitive to slow/spinning
>drives. My impression is that Linux is generally better at caching disk
>data in RAM, which hides some of the slowness (esp. when working with
>small files).

The server edition of Windows behaves more like Linux.

The workstation edition by default has a /per-application/ limit on
the amount of data that can be cached. Even if there is more space
available, once an application reaches the limit it will start to
cache thrash. This is independent of the program file handle limit.

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session
Manager\Memory Management]
"LargeSystemCache"=dword:00000000

Set to 1 to enable, 0 to disable. Note that LargeSystemCache disables
per-app data limits, so an application that uses huge amounts of file
data potentially can take up to the whole cache.

This little app can set/limit the total size of the file cache (not
the per-app limits). By default Windows will use up to ~90% of free
memory.
https://www.microimages.com/downloads/SetFileCacheSize.htm

Note: this need Admin privileges, and although you can change the
cache size on the fly, the settings will NOT survive OS restart. If
you find it useful, you should schedule it as a task to set your
chosen cache size at boot.

I'm sure there is a way to change the per-app data limits, but
unfortunately I don't know the magic incantations to search for.

HP used to maintain a comprehensive listing of Windows registry
settings, but unfortunately I lost the link and Google is not being
cooperative just now. Sigh!

George

MitchAlsup

unread,

Nov 23, 2021, 1:44:33 PM11/23/21

to

Up to this point you need a Greater than decoder and a less than decoder
and an AND gate.
<
> 7FFE000000003FFF
> ...
Why split the field around the container boundary ??

Terje Mathisen

unread,

Nov 23, 2021, 3:21:37 PM11/23/21

to

Marcus wrote:
> On 2021-11-23 15:24, David Brown wrote:
>> On 22/11/2021 21:05, BGB wrote:
>>> On 11/22/2021 11:50 AM, Marcus wrote:
>>>> On 2021-11-22 16:54, Thomas Koenig wrote:
>>>>> Marcus <m.de...@this.bitsnbites.eu> schrieb:
>>>>>
>>>>>> A complete rebuild of binutils +
>>>>>> bootstrap GCC + newlib + GCC takes about 8 minutes on my 3900X.
>>>>>
>>>>> That is blindingly fast, it usually takes about half to 3/4 on an

>>>>> hour with recent versions.Â Which version is your gcc port

>>>>> based on?
>>>>>
>>>>
>>>> I'm on GCC trunk (12.0). Same with binutils and newlib.
>>>>
>>>> I use Ubuntu 20.04, and the HW is a 3900X (12-core/24-thread) CPU, with
>>>> an NVMe drive (~3 GB/s read).
>>>>
>>>
>>> In my case, I am running Windows 10 on a Ryzen 2700X (8-core, 16-thread,
>>> 3.7 GHz).

>>> Â OS + Swap is on a 2.5" SSD ( ~ 300 MB/s )
>>> Â Rest of storage is HDDs, mostly 5400 RPM drives (WD Green and WD

>>> Red).
>>>
>>> My MOBO does not have M.2 or similar, but does have SATA connectors, so
>>> the SSD is plugged in via SATA.
>>>
>>> RAM Stats:

>>> Â 48GB of RAM, 1467MHz (DDR4-2933)
>>> Â 192GB of Pagefile space.

It turns out that a group of people which include Robert Collins, a
co-worker (at Cognite), have done a lot of work on speeding up Rust
installations on both unix/linux and Microsoft platforms, they found
that Microsoft can in fact reach parity with Linux, but with serious effort:

https://www.youtube.com/watch?v=qbKGw8MQ0i8

It turns out that the actual throughput is approximately the same for
Windows/NTFS and Linux, but as you noted Defender & co makes the latency
orders of magnitude worse. By overlapping pretty much all the file
metadata work (mainly file creations) those latency bubbles can in fact
be filled in. Afair they reduced the installation time from several
minutes to ~10 seconds.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

MitchAlsup

unread,

Nov 23, 2021, 6:10:28 PM11/23/21

to

Now if MS would simply allow one to DD <current image> -> <new image>
so that recovery was 6 minutes instead of 6 hours we might be getting
somewhere...

BGB

unread,

Nov 23, 2021, 6:20:15 PM11/23/21

to

Masks like this happen sometimes, and supporting both an AND gate and an
OR gate would not likely cost that much more than the AND gate by itself.

Goes and runs some stats:
Constants that fit these patterns: ~ 30% of the total constants;
Percentage of masks which end up using jumbo 64 or 96: ~ 6%
Percentage of total constant loads, masks as jumbo: ~ 2%
Percentage of total instructions, masks as jumbo: ~ 0.15%

For something the size of Doom, it would save ~ 670 bytes.

Of the total constants (for Doom):
~ 75% LDI Imm16, Rn
~ 15% LDI Imm8, Rn (16-bit encoding)
~ 2% LDIHI Imm10, Rn (Load Imm10 into high-order bits)
~ 0.5% FLDCH Imm16, Rn (Load Half-Float as Double, *1)
~ 6% Jumbo64 (Imm33s)
~ 0.6% Jumbo96 (Imm64)

*1: This stat is somewhat bigger in Quake, but there are very few FP
constants in Doom.

In Quake, FLDCH is ~ 8% of the constant loads, and ~ 3% are Jumbo96
encodings (seems to be mostly Binary64 constants which can't be
expressed exactly as either Binary16 or Binary32).

The ratio of bit-mask patterns to other constants is also a little lower
in Quake vs Doom.

It looks like, for FP constants (~ 14% of the total):
~ 60% can be expressed exactly as Binary16
~ 20% can be expressed exactly as Binary32
~ 20% fall through to Binary64.

May need to look at it, a larger percentage should theoretically fit the
Binary32 pattern, given Quake is primarily using "float" operations.

Goes and looks, "Float+ImmDouble" will coerce the ImmDouble to ImmFloat,
which will forcibly round the result. The number of seemingly
double-precision constants seems mysterious.

...

BGB

unread,

Nov 23, 2021, 6:52:11 PM11/23/21

to

One thing I had considered in the past was potentially moving a lot of
things like C library headers into a VFS (so the compiler would try to
fetch header files from the VFS rather than the OS filesystem).

In such a system, every major installed library would likely be an image
file which is then mounted into the compiler's VFS (likely read-only).

I didn't do so for BGBCC, mostly because this seemed like it would be a
bit of a hassle to work with, and my existing strategies (such as
keeping previously loaded header files cached in RAM between translation
units) had mostly already gotten these issues under control.

> Terje
>

MitchAlsup

unread,

Nov 23, 2021, 8:32:13 PM11/23/21

to

So, translating the above into My 66000
<
90% IMM16
either
8% IMM32 and 2% IMM64
or
6% IMM32 and 4% IMM64
depending
on whether the 2% LDHI Imm10 goes to the HoBs of 32-bits or of 64-bits.
<
Both are within spitting distance of what I see in ASM code.
<
Now, I do have IMM12 which is used strictly for shifts.

>
> *1: This stat is somewhat bigger in Quake, but there are very few FP
> constants in Doom.
>
>
> In Quake, FLDCH is ~ 8% of the constant loads, and ~ 3% are Jumbo96
> encodings (seems to be mostly Binary64 constants which can't be
> expressed exactly as either Binary16 or Binary32).
>
> The ratio of bit-mask patterns to other constants is also a little lower
> in Quake vs Doom.
>
> It looks like, for FP constants (~ 14% of the total):
> ~ 60% can be expressed exactly as Binary16
> ~ 20% can be expressed exactly as Binary32
> ~ 20% fall through to Binary64.
<

This corroborates the VAX-11/780 FP constants.

>
> May need to look at it, a larger percentage should theoretically fit the
> Binary32 pattern, given Quake is primarily using "float" operations.
>
> Goes and looks, "Float+ImmDouble" will coerce the ImmDouble to ImmFloat,
> which will forcibly round the result. The number of seemingly
> double-precision constants seems mysterious.
<

Have you "too easily" converted FP constants to 64-bits when they could have
been smaller ?

BGB

unread,

Nov 23, 2021, 10:49:58 PM11/23/21

to

The LDHI instruction can encode either case (63:54) or (31:22) depending
on the E.Q bit, but my stat counts don't distinguish between these cases.

The ratios were fairly similar between the programs I looked at.

Main difference being that Doom and ROTT have significantly fewer
floating point constants than Quake, but this shouldn't really come as
any big surprise.

> <
> Both are within spitting distance of what I see in ASM code.
> <
> Now, I do have IMM12 which is used strictly for shifts.

I don't have an Imm12 case, and my shift instructions generally use
Imm8. This is treated as an 8-bit sign-extended quantity (positive
encodes left shift, negative encodes right shift).

The exact interpretation depends on the type size:
32 bits(SHAD /SHLD ): Treated as Mod-32
64 bits(SHADQ/SHLDQ): Treated as Mod-64
128 bits(SHADX/SHLDX): Valid range is ± 127.

There are also some SHLR/SHAR/... variants which encode a right-shift by
internally flipping the shift direction (register only). This was mostly
to avoid needing to negate the input to encode variable right shifts.

>>
>> *1: This stat is somewhat bigger in Quake, but there are very few FP
>> constants in Doom.
>>
>>
>> In Quake, FLDCH is ~ 8% of the constant loads, and ~ 3% are Jumbo96
>> encodings (seems to be mostly Binary64 constants which can't be
>> expressed exactly as either Binary16 or Binary32).
>>
>> The ratio of bit-mask patterns to other constants is also a little lower
>> in Quake vs Doom.
>>
>> It looks like, for FP constants (~ 14% of the total):
>> ~ 60% can be expressed exactly as Binary16
>> ~ 20% can be expressed exactly as Binary32
>> ~ 20% fall through to Binary64.
> <
> This corroborates the VAX-11/780 FP constants.

Could be. I had noted previously that a fair number of "typical"
constants in C code could be expressed directly as Binary16.

My ISA already had a Binary16 to Binary64 converter instruction, so I
added an encoding to feed it a constant.

>>
>> May need to look at it, a larger percentage should theoretically fit the
>> Binary32 pattern, given Quake is primarily using "float" operations.
>>
>> Goes and looks, "Float+ImmDouble" will coerce the ImmDouble to ImmFloat,
>> which will forcibly round the result. The number of seemingly
>> double-precision constants seems mysterious.
> <
> Have you "too easily" converted FP constants to 64-bits when they could have
> been smaller ?

Within BGBCC, there are a few ways to store floating-point literals:
Float Literal, which stores it inline as a Binary32 value;
Double Literal, which stores it as an index into a "literal value table"
(same table is also used for Long constants, and pairs for Int128
constants).

When an expression is type Float, it coerces the immediate to use the
Float Literal format. Expressions in C tend to not use any suffixes on
floating-point constants, so the idea is the compiler will instead uses
the type from the non-constant operand for the type of the expression in
this case.

Within registers, scalar values are always operated on as Binary64
(regardless of the type of the variable).

The constant-load is thus given a value in Binary64 format, but the
compiler may use a more compact format for storage (generally by trying
to convert the value to the smaller format and then convert it back and
seeing if the values match).

If the value had been converted to Binary32 previously, then this
round-trip check should pass.

As is, there are several ways to encode a floating-point literal (scalar):
Binary16, "FLDCH Imm16, Rn" (Convert Binary16 Immed to Binary64)
Binary32, "FLDCF Imm32, Rn" (Convert Binary32 Immed to Binary64)
Truncated Binary64, "LDIQHI Imm32, Rn"
Raw Binary64, "LDI Imm64, Rn"

The issue is mostly that it appears to be using a full 64-bit load more
often than would be expected here.

Goes and looks at the 3AC dump...

OK... It seems the issue may be that it is not actually coercing the
constants to Binary32 constants in the first place, it is instead
loading them as Binary64 constants end then using type-conversion
operations to convert them to Float.

Or, IOW:
float x, y;
y=x+3.0;
Is not being handled as:
y=x+3.0f;
But, instead:
double t0;
float x, y, t1;
t0=3.0;
t1=(float)t0;
y=x+t1;
...

This is... Not quite right... I will probably need to fix this.

Terje Mathisen

unread,

Nov 24, 2021, 2:05:45 AM11/24/21

to

Also file close, i.e. anything that can trigger the virus scanner(s) to
take a look before releasing/completing the IO.

>> be filled in. Afair they reduced the installation time from several
>> minutes to ~10 seconds.
> <
> Now if MS would simply allow one to DD <current image> -> <new image>
> so that recovery was 6 minutes instead of 6 hours we might be getting
> somewhere...

That _has_ to be possible, since hibernation needs exactly that kind of
infrastructure, right?

For some probably good reasons, they don't want to expose the kernel
interfaces which allows this to work?

Marcus

unread,

Nov 24, 2021, 8:33:52 AM11/24/21

to

Thanks! That's a very interesting talk (I'm halfway through and got to
the non-blocking CloseHandle() part). I will most likely investigate
some of those tricks for BuildCache on Windows.

/Marcus

BGB

unread,

Nov 25, 2021, 3:00:24 AM11/25/21

to

Fixing the bug mentioned previously seems to address this issue...

After fixing the issue, there are somewhat more Binary32 loads than
Binary64 loads, with ~ 2% of the constant loads being Binary64.

>
> Goes and looks at the 3AC dump...
>
>
> OK... It seems the issue may be that it is not actually coercing the
> constants to Binary32 constants in the first place, it is instead
> loading them as Binary64 constants end then using type-conversion
> operations to convert them to Float.
>
> Or, IOW:
> float x, y;
> y=x+3.0;
> Is not being handled as:
> y=x+3.0f;
> But, instead:
> double t0;
> float x, y, t1;
> t0=3.0;
> t1=(float)t0;
> y=x+t1;
> ...
>
> This is... Not quite right... I will probably need to fix this.
>

Status update:
Fixing this issue, and also discovering and fixing another issue which
was resulting in incorrect type conversions (it was coercing constant
values to a smaller type in some cases where it should have
type-promoted instead), fixed some of the bugs I was encountering in Quake.

There are still a lot of cases where expressions are promoting to
'double' only to then be forced back to 'float'. Not really a quick/easy
fix though (and in these cases, the promotions are following C's
type-promotion rules; it is more a case of it being slightly less
efficient than it could be).

Still no effect on the results of the E3M5 desync though.

Did start wondering though if there is a way to make the other demos not
desync, but looking into it, it appears every version of the Doom engine
had slightly different behavior, and it appears some of the
newer/fancier multi-game engines try to infer the Doom engine version
from the IWAD and change around a bunch of settings to try to match the
behavior of the particular engine for the particular IWAD (eg, minor
changes from one version of the Doom engine breaking demos originally
recorded on another version of the Doom engine).

Or, otherwise, it is likely getting "correct" demo playback from an
ad-hoc port of the Linuxdoom source might be asking a bit much...

Heretic and Hexen still behave as before.

Though, as noted:
Heretic seems to remain in-sync for the Shareware IWAD;
It desyncs for two of the demos in the release IWAD.

This is the same between builds, though there does appear to be a
difference in that in one of the demos, in some cases an enemy drops a
"morph ovum" and in other cases they do not (implying a there is some
non-deterministic property somewhere).

Previously, in a case where demo playback was diverging between targets
in Doom, it was due to an out-of-bounds memory access and an apparent
divergence in the contents of the out-of-bound memory contents between
targets (possibly a pointer within the Z_Malloc headers or similar).

It is possible I could be looking at something similar in this case,
rather than necessarily a difference in compiler behavior.

No visible change in ROTT behavior, though it is nowhere near close to
matching the behavior of the x86 build (in that an x86 build with MSVC
does actually manage to play the demos correctly, but builds for other
targets are prone to desync).

Though, if anything, a lot of this is probably evidence for why it is
probably better, if implementing a game engine, to implement demos in
terms of recording game events and similar rather than by recording user
keypresses.

...

Marcus

unread,

Nov 25, 2021, 3:42:19 AM11/25/21

to

On 2021-11-25 09:00, BGB wrote:

[snip]

You seem to be using several of the old C code classics (Doom, Quake,
ROTT, Heretic, ...?). Have you seen
https://github.com/videogamepreservation ? A gold mine of old games (it
even has both C and Fortran versions of Zork!).

I wonder, which code bases have been most useful to work with?

I have ported Doom and Quake (obvious candidates, and I have ported them
to other architectures before so it was a fairly low barrier). I
personally use Quake a lot since it is a fairly large C code base with
near zero OS/platform dependencies, and it also has floating-point.
It's far from perfect, though (it's showing its age).

/Marcus

Terje Mathisen

unread,

Nov 25, 2021, 4:37:21 AM11/25/21

to

Marcus wrote:
> I have ported Doom and Quake (obvious candidates, and I have ported them
> to other architectures before so it was a fairly low barrier). I
> personally use Quake a lot since it is a fairly large C code base with
> near zero OS/platform dependencies, and it also has floating-point.
> It's far from perfect, though (it's showing its age).

Quake showing its age? Please tell me it ain't so!

It was only 25+ years ago that I tried to grok the original C and Mike
Abrash' asm version, so that I could look for possible speedups. :-)

Marcus

unread,

Nov 25, 2021, 6:21:47 AM11/25/21

to

On 2021-11-25 10:37, Terje Mathisen wrote:
> Marcus wrote:
>> I have ported Doom and Quake (obvious candidates, and I have ported them
>> to other architectures before so it was a fairly low barrier). I
>> personally use Quake a lot since it is a fairly large C code base with
>> near zero OS/platform dependencies, and it also has floating-point.
>> It's far from perfect, though (it's showing its age).
>
> Quake showing its age? Please tell me it ain't so!

Yeah :-) The things that I wouldn't expect to see in a more modern code
base is its excessive use of global variables (both for passing
information between functions, and just because they didn't bother with
using the static keyword), and it's excessive use of shorts and bytes
(some of it is motivated, e.g. in packed data structures, but some of it
just seems to be because "shorts are faster & better than ints!").

These things may have been good/acceptable on x86-32 where you didn't
have any GPR:s anyway (and the ones you had could be used as 8-bit or
16-bit registers), but it's really not a good design for modern
load/store machines (nor a good SW architecture).

>
> It was only 25+ years ago that I tried to grok the original C and Mike
> Abrash' asm version, so that I could look for possible speedups. :-)
>

I think I dabbled with it just before the source was made public (found
it on some dodgy FTP server), and ported it to DEC Alpha (porting to a
64-bit arch was "fun" back then).

/Marcus

Terje Mathisen

unread,

Nov 25, 2021, 9:23:28 AM11/25/21

to

Marcus wrote:
> On 2021-11-25 10:37, Terje Mathisen wrote:
>> Marcus wrote:
>>> I have ported Doom and Quake (obvious candidates, and I have ported them
>>> to other architectures before so it was a fairly low barrier). I
>>> personally use Quake a lot since it is a fairly large C code base with
>>> near zero OS/platform dependencies, and it also has floating-point.
>>> It's far from perfect, though (it's showing its age).
>>
>> Quake showing its age? Please tell me it ain't so!
>
> Yeah :-) The things that I wouldn't expect to see in a more modern code
> base is its excessive use of global variables (both for passing
> information between functions, and just because they didn't bother with
> using the static keyword), and it's excessive use of shorts and bytes
> (some of it is motivated, e.g. in packed data structures, but some of it
> just seems to be because "shorts are faster & better than ints!").

I didn't look at those parts, had enough to do with helping Mike with
the asm code which was already 2.5x faster than the original C.

>
> These things may have been good/acceptable on x86-32 where you didn't
> have any GPR:s anyway (and the ones you had could be used as 8-bit or
> 16-bit registers), but it's really not a good design for modern
> load/store machines (nor a good SW architecture).

You've never really seen register pressure if you haven't tried to
optimize something like affine texture mapping on a 486 (or earlier)
running in 16-bit mode. I wrote lots of code in those days that
absolutely had to use both the low and high half of AX/BX/CX or DX as
separate 8-bit registers.

Said affine mapping could take advantage of 32-bit regs with 16:16
fixed-point math, that was the fastest way to do it until I discovered
how to cheat. :-)

>
>>
>> It was only 25+ years ago that I tried to grok the original C and Mike
>> Abrash' asm version, so that I could look for possible speedups. :-)
>>
>
> I think I dabbled with it just before the source was made public (found
> it on some dodgy FTP server), and ported it to DEC Alpha (porting to a
> 64-bit arch was "fun" back then).

I believe you!

BGB

unread,

Nov 25, 2021, 1:50:00 PM11/25/21

to

Yeah.

Ports, actively used (mostly based on the original source releases):
Quake (both software and GLQuake)
Doom
Heretic
Hexen
ROTT

Partial (not uploaded as of yet):
Quake 3 Arena (incomplete, OS support, *1)
Wolfenstein 3D (incomplete, license issues, ...)

*1: To work effectively, Quake 3 will require both virtual memory and
DLL loading. Neither of these really "have the kinks worked out". It is
also very likely to have considerably worse performance than Quake 1 (so
most likely, basically a slide-show).

A lot of work also went into trying to reduce Quake 3's memory footprint
to something more reasonable for running on an FPGA, and partly allowing
the ZIP-based PK3 format to be replaced with my own 'WAD4' format. Which
needs both less RAM, and is faster; though RP2 gets worse compression
than Deflate. Another change was allowing the JPG and TGA textures to be
replaced with DDS textures (DXT1 or DXT5).

In vertex-lighting mode, Quake 3 seems to draw less geometry than Quake
1 or 2, but one concern is that it has a somewhat larger and more
complicated BSP's, which was already a big source of slowdown in Quake 1
(GLQuake was still limited to single-digit frame-rates even if one
doesn't draw any geometry).

Though, it does appear if one could redo the Quake 1 maps with Quake 3's
BSP format and tools (say, qbsp3 modified to accept Quake 1 map files),
it is possible it could be faster than the original Quake 1 BSPs.

Some small experiments:
Wrote a video player (several codecs, AVI format, *2);
Wrote a miniature raycast miniature Minecraft style engine;
Wrote a Mod player (still unreleased IIRC, *3)

*2:
Video:
RPZA: "QuickTime Video"
CRAM: "MS Video 1"
BT4B: one of my own codecs
BT5A: Another of my own codecs, simple indexed-color color-cell.
It is similar in concept to CRAM, but gets a lower bitrate.
Faster / simpler to decode than BT4B.
Audio:
IMA ADPCM
BTAC1C (extended form of IMA ADPCM)
Decoder is backwards compatible with IMA ADPCM.

No MPEG style codecs; I doubt BJX2 (at 50MHz) has the go power to make
MPEG style codecs "not suck".

While RPZA and CRAM are simple to decode, they ran into the problem that
their bitrates are bad enough to run into the limits of the IO bandwidth
from the SDcard (if running it at 5MHz, or ~ 600kB/s). One either has to
use very low resolutions of very poor quality settings, which did not
look good (I wanted, say, 320x200 video, not 160x100).

My BT4B codec did OK; was able to push a color-cell based codec into
being bitrate competitive with MPEG-1, and it decoded very fast by PC
standards, but it was entropy coded and still fairly computationally
demanding to decode on BJX2 (as well as being a fairly complicated
format in general).

So, I came up with BT5A as a compromise, intended to reduce the bitrate
while keeping decoding complexity more in line with something like RPZA
or CRAM, and using a dynamically modified index-color palette rather
than RGB555, ...

Structurally, it resembled a hybrid of CRAM, RPZA, and RP2. Like these
formats, it was also built around a raw byte stream (no entropy coding);
with a baseline color-cell format of 4x4x1 pixels (16 bits) and two
8-bit index-color endpoints.

*3: My initial attempt at MOD playback had performance issues with the
sound mixing. In this case, it is doing the mixing in software (no
dedicated hardware mixer).

Some of the MODs tested had sample data too large to fit effectively
into the L2 cache; and they were using mostly premixed PCM audio and
effectively using the MOD format like a chunked WAV file, which seems to
sorta defeat the point of the MOD format.

A lot of the other examples I could find were either "kinda awful", or
were built using more complicated formats (S3M/XM/IT/...).

> Have you seen
> https://github.com/videogamepreservation ? A gold mine of old games (it
> even has both C and Fortran versions of Zork!).
>

Not really dug around on there.

I mostly have games which I was aware of the source for, and which the
porting effort didn't look too absurd.

The games mostly tend to provide quick visual feedback if something has
broken somewhere (and tend to have better coverage than what I have been
able to pull off with "sanity tests").

> I wonder, which code bases have been most useful to work with?
>

In terms of testing C compiler features, Doom and Quake work well.

Hexen and Heretic are basically Doom variants, and my ports of these
engines mostly reused code which was copy/pasted from my Doom port. The
main difference (at the low level) was that Hexen's sound system was
somewhat redesigned from Doom, so required a little more work.

The ROTT port involved a whole lot of stripping out and rewriting a
bunch of x86 specific stuff (into C), and sorta awkwardly emulating the
VGA hardware via function calls because otherwise it would have required
a more significant rewrite of the renderer.

The engine was also very prone to out-of-bounds memory accesses.

Codebase is basically sorta like Wold3D + parts of Doom (but, many of
these parts used poorly, like they seemingly didn't quite get the point
of the WAD format and implemented a whole bunch of stuff with computed
lump indices).

Some parts of my porting effort also involved rewriting parts of the
engine to access lumps by name; allowing the engine to be modded Doom
style by loading PWADs. For sake of my "Wolf3D in the ROTT engine"
effort, had also allowed the maps to be put into the WAD file (like in
Doom), and allowed the RLE compression to be replaced with LZ
compression, ...

It also has the odd bits of code still using K&R declaration syntax and
similar, ... (Though, does raise the question of whether everyone
adopted the C89 syntax fairly quickly, or if most of the existing
compilers and programmers had already migrated to the C89 syntax before
it got standardized).

> I have ported Doom and Quake (obvious candidates, and I have ported them
> to other architectures before so it was a fairly low barrier). I
> personally use Quake a lot since it is a fairly large C code base with
> near zero OS/platform dependencies, and it also has floating-point.
> It's far from perfect, though (it's showing its age).
>

Yeah, it works well.

Harder is getting it to perform acceptably at 50MHz.
It looks like it would have at least been playable at 100MHz.

Current ports I have are:

The software renderer, mostly the C version, with a few parts rewritten
in ASM, and modified slightly to replace the use of 8-bit indexed color
with 16-bit hi-color. In my early tests, switching to hicolor was faster
than drawing as indexed color and then converting the image to hi-color
when copying the image to video RAM (despite the likely increase in
cache miss rate); sorta similar was done with my Doom port.

GLQuake, with a custom-written Software GL.

It sorta goes back and forth which was faster. My Software GL was faster
for a while, but with some recent compiler work, the software renderer
regained some of the lost ground.

Ironically, in this case, the software renderer looks a little nicer
than my software GL implementation. A lot of corners were cut in the
name of trying to make it go acceptably fast.

As can be noted, this implementation:
More or less implements the OpenGL 1.x feature-set;
Does so enough to run Quake 3 at least;
Supports compressed textures, uploaded as DXT1 or DXT5 but transcoded
into an internal format for which a hardware decoder exists;
Mostly limited to square power-of-2 texture-sizes, but these are pretty
much standard in OpenGL.

Both could probably be "better" if I had faster FPU ops.
I had been tempted to add some logic for faster but lower precision FPU
and SIMD ops (likely focused on Binary32).

Specifics differ, don't want to spend encoding space on it, so the
leading option would be to encode it via rounding modes, and/or setting
up a "go fast, I don't care as much about precision" mode (which then
shaves a few cycles off the operations).

This is partly because, after optimizing a lot of other stuff, FADD and
FMUL have increased somewhat in the ranking of "instructions eating lots
of clock cycles" to be more worth considering.

One "likely extra fast" option would be to limit the mantissa to 18
bits, which would allow using a single DSP for the FMUL (and possibly
allow for fully pipelined FPU ops in this case); but some of my testing
implies that this is likely insufficient for Quake.

If one drops "float" calculations below Binary32 precision (by even only
a few bits), then Quake starts glitching out, with stuff in the
"progs.dat" VM glitching out, geometry issues, the player slowly
spinning in a circle, ...

But, short of dropping down this far, I have doubts it could be made
fast enough to fit into the main pipeline, which is basically what would
be required to pass the "worthwhile" threshold (vs the current 6 cycle
Binary64 ops).

More so, it still wouldn't be enough to make Quake all that playable
(but could probably help at least).

Granted, if I could make FPU and FPU SIMD ops a bit faster, this would
also help with my software OpenGL rasterizer (which puts lots of clock
cycles into FPU and SIMD ops).

...

> /Marcus

James Van Buskirk

unread,

Dec 4, 2021, 10:12:06 PM12/4/21

to

"Marcus" wrote in message news:snnrk8$dgl$1...@dont-email.me...

> On 2021-11-25 10:37, Terje Mathisen wrote:

> > Quake showing its age? Please tell me it ain't so!

> I think I dabbled with it just before the source was made public (found
> it on some dodgy FTP server), and ported it to DEC Alpha (porting to a
> 64-bit arch was "fun" back then).

Are you the one responsible for that port? It always crashed when you
fired the BFG9000 :(

Marcus

unread,

Dec 8, 2021, 4:29:32 PM12/8/21

to

I doubt that the port ever left the Swedish university that I studied at
back than, or even the few accounts that had the copy (but what do I
know?).

/Marcus

Ir. Hj. Othman bin Hj. Ahmad

unread,

Dec 16, 2021, 4:55:12 PM12/16/21

to

On Monday, 22 November 2021 at 01:13:13 UTC+8, Marcus wrote:
> Just wrote this short post. Maybe someone finds it interesting...
>
> https://www.bitsnbites.eu/i-want-to-show-a-thing-cpp-code-generation/
>
> /Marcus

I do not regularly see assembly output of compilers but I suspect something like this should happen.
Thank you for sharing. This is an extreme example.

HLL is very poor in bit manipulations. Better use a library for these bit manipulations, so less need for optimization.

I used to hauant comp.arch 30 years ago, in the 1990s. Now, I do nnot know that I need to subscribe to this group.
I did use a differnt email. But more than 10 yeaars ago, I used this email to post to comp.arch also. I was able to see my post 30 years ago. Not sure now.

MitchAlsup

unread,

Dec 16, 2021, 6:22:31 PM12/16/21

to

On Thursday, December 16, 2021 at 3:55:12 PM UTC-6, Ir. Hj. Othman bin Hj. Ahmad wrote:
> On Monday, 22 November 2021 at 01:13:13 UTC+8, Marcus wrote:
> > Just wrote this short post. Maybe someone finds it interesting...
> >
> > https://www.bitsnbites.eu/i-want-to-show-a-thing-cpp-code-generation/
> >
> > /Marcus
>
> I do not regularly see assembly output of compilers but I suspect something like this should happen.
> Thank you for sharing. This is an extreme example.
>
> HLL is very poor in bit manipulations.
<

HDLs are very good in bit manipulation::
Verilog: variable[first..last] just wire the bits as is.
Verilog: variable[last..first] wire the bits up backwards

<
> Better use a library for these bit manipulations, so less need for optimization.
<

Bit manipulation (for fields less than 64-bits in size) are available natively in the My 66000 ISA by
<
# define fieldspec(a,b) (((a)<<32)|(b))
<
extracted = container >> fieldspec(width,offset);
//
// The HW checks :: 1 <= bits[37..32] <= 64 and 0 <= bits[5..0] <= 63
// The HW allows bits[37..32]==0 as a representation of 64-bits.
//
No need for any kind of library; << and >> can directly access bit fields.

Thomas Koenig

unread,

Dec 17, 2021, 2:28:30 AM12/17/21

to

Ir. Hj. Othman bin Hj. Ahmad <oth...@gmail.com> schrieb:

> HLL is very poor in bit manipulations. Better use a library for
> these bit manipulations, so less need for optimization.

Not really - calling a library function has a _lot_ of overhead,
and what goes on in the library is hidden from the compiler. It is
usually preferred for the compiler to emit highly optimized code.

These days, compilers will might even recognize some standard
idioms for certain bit manipulations and use a corresponding
machine instruction, if available.

chris

unread,

Dec 17, 2021, 6:28:38 AM12/17/21

to

Working primarily in embedded space, would never use bitfields in
C, as not guaranteed to be portable. Tend to use inline mask tables
for that that functionality and stick to safe subset of C.

Chris

David Brown

unread,

Dec 17, 2021, 8:42:18 AM12/17/21

to

Working primarily in embedded development, I quite happily use bitfields
because you generally know exactly what hardware you are working with,
and bitfields can significantly reduce some kinds of error.

Not being guaranteed to be portable is rarely as big an issue as people
often think. If you are writing a driver for a peripheral in a
microcontroller, portability is already limited by the specifics of the
hardware.

There are a few things about bitfields that are not tightly specified.
Ordering of bits, padding and alignment, and the types of bitfields
supported are the usual ones. But a lot of that is either easily
checked (such as a static assertion on the size of the struct),
consistent for the toolchain (gcc supports the same types across all
targets), or consistent for the architecture (most processor's have an
ABI which specifies bitfield details).

The key place to avoid (or be very careful with) bitfields due to
portability issues is when you have data exchange back and forth from
the system you are using - network protocols, file formats, and the like.

David Brown

unread,

Dec 17, 2021, 8:51:38 AM12/17/21

to

On 17/12/2021 08:28, Thomas Koenig wrote:
> Ir. Hj. Othman bin Hj. Ahmad <oth...@gmail.com> schrieb:
>
>> HLL is very poor in bit manipulations. Better use a library for
>> these bit manipulations, so less need for optimization.
>
> Not really - calling a library function has a _lot_ of overhead,
> and what goes on in the library is hidden from the compiler. It is
> usually preferred for the compiler to emit highly optimized code.
>

The best choice here is a header library (or equivalent for languages
other than C, C++ and friends). You can write your static inline
functions or templates in whatever you feel is the nicest way for your
needs, and re-use these functions with no overhead.

> These days, compilers will might even recognize some standard
> idioms for certain bit manipulations and use a corresponding
> machine instruction, if available.
>

Compilers have been doing that for decades. Details vary, of course,
and sometimes they are not quite optimal depending on exactly how you
write the bit manipulation. But they are usually pretty solid.

chris

unread,

Dec 17, 2021, 11:52:58 AM12/17/21

to

o each his own, but take a slightly different approach. Have a
standard header file included at the top of all modules, with
all the bit definitions. Within the modules, short
const tables to use as masks. The whole idea is to abstract away
anything that could be affected by the compiler. Done right, the
code can be expected to compile with any modern compiler. All
code here is expected to be reused, so some effort is made to
help that along. If that means using a subset of C, then
fair enough.

Bitfields may have been a novel idea with code storage limitations
of the past but just one of those things in C that seem just a
bit dodgy. So many horror stories about bitfields with early
compilers, so best avoided. The early MISRA standard thought so
as well. Program defensively seems like common sense to me
and far more productive in the long run...

Chris

David Brown

unread,

Dec 17, 2021, 12:27:05 PM12/17/21

to

Of course - there is no "right" answer here!

> Have a
> standard header file included at the top of all modules, with
> all the bit definitions. Within the modules, short
> const tables to use as masks. The whole idea is to abstract away
> anything that could be affected by the compiler. Done right, the
> code can be expected to compile with any modern compiler.

Done right and used appropriately, bitfields are perfectly suitable for
use in portable code and any modern compiler. That doesn't mean they
are necessarily /better/ for any particular use, just that they are not
something that you have to pass on if you want safe, reliable, efficient
and portable code.

> All
> code here is expected to be reused, so some effort is made to
> help that along. If that means using a subset of C, then
> fair enough.

Everyone writes in a subset of C - we just have different choices as to
what those subsets should be :-)

>
> Bitfields may have been a novel idea with code storage limitations
> of the past but just one of those things in C that seem just a
> bit dodgy. So many horror stories about bitfields with early
> compilers, so best avoided.

No - the answer is to avoid horrible early compilers. It is silly to
restrict how you can work purely because of outdated problems.

> The early MISRA standard thought so
> as well. Program defensively seems like common sense to me
> and far more productive in the long run...
>

MISRA has always seemed to me to be a mixture of a few good rules, a
number of directly bad and counter-productive rules, and a whole lot of
things that could be summed up by "Learn the C language and use it
correctly" and "Don't be stupid". There is a difference between
programming defensively and limiting good programming techniques because
of bad tools. It is far more productive to use good tools.

chris

unread,

Dec 17, 2021, 12:54:13 PM12/17/21

to

Oh, agree with much of that, but we are all conditioned by
experience and I think the bitfield thing was so ingrained and
have used other methods for so long, unlikely to change now.

Right about some of the early compilers as well. We have a lot to be
thankful for with the gnu tools, which put to rest for good, expensive
tool chains and license manager dongles...

Chris

Terje Mathisen

unread,

Dec 17, 2021, 3:42:25 PM12/17/21

to

chris wrote:
> On 12/17/21 13:42, David Brown wrote:
>> The key place to avoid (or be very careful with) bitfields due to
>> portability issues is when you have data exchange back and forth from
>> the system you are using - network protocols, file formats, and the like.
>>
>
> o each his own, but take a slightly different approach. Have a
> standard header file included at the top of all modules, with
> all the bit definitions. Within the modules, short
> const tables to use as masks. The whole idea is to abstract away
> anything that could be affected by the compiler. Done right, the
> code can be expected to compile with any modern compiler. All
> code here is expected to be reused, so some effort is made to
> help that along. If that means using a subset of C, then
> fair enough.

I don't use that header setup, I'm closer to David here:

I'm guessing that I have written more low-level bit-twiddling code than
most here (graphics, io drivers, codecs, crypto, network protocols, fp
emulation), and I have _never_ used bitfields to implement any of this.

I'll typically load 32 or 64 bits surrounding the target are, swap the
bytes if there's an endian change, then use regular shift & mask ops to
get at what I need.

MitchAlsup

unread,

Dec 17, 2021, 5:48:40 PM12/17/21

to

We teach compilers to recognize << followed by >> and convert it into
EXT instructions.
We teach compilers to recognize & followed by << followed by &~
followed by | and convert it into INS instructions (when appropriate).

MitchAlsup

unread,

Dec 17, 2021, 9:54:43 PM12/17/21

to

Then there was other bit twiddling codes that generally read something
like:
<
if( p->state & SOMESTATE )
{
p->state ^= SOMESTATE | SOMEOTHERSTATE;
.....
}
<
Using XORs to get rid of the current state and change it to some new state
while leaving the rest of the container alone. This is often faster than actually
using bit fields.

BGB

unread,

Dec 17, 2021, 11:00:19 PM12/17/21

to

On 12/17/2021 7:51 AM, David Brown wrote:
> On 17/12/2021 08:28, Thomas Koenig wrote:
>> Ir. Hj. Othman bin Hj. Ahmad <oth...@gmail.com> schrieb:
>>
>>> HLL is very poor in bit manipulations. Better use a library for
>>> these bit manipulations, so less need for optimization.
>>
>> Not really - calling a library function has a _lot_ of overhead,
>> and what goes on in the library is hidden from the compiler. It is
>> usually preferred for the compiler to emit highly optimized code.
>>
>
> The best choice here is a header library (or equivalent for languages
> other than C, C++ and friends). You can write your static inline
> functions or templates in whatever you feel is the nicest way for your
> needs, and re-use these functions with no overhead.
>

It depends some.

One might end up also having a header which serves mostly to detect some
compiler and machine specific defines and tune the logic for performance.

Say, for GCC, one might want to use inline functions, but for MSVC it
might be better to do everything with preprocessor macros, ...

>> These days, compilers will might even recognize some standard
>> idioms for certain bit manipulations and use a corresponding
>> machine instruction, if available.
>>
>
> Compilers have been doing that for decades. Details vary, of course,
> and sometimes they are not quite optimal depending on exactly how you
> write the bit manipulation. But they are usually pretty solid.

Recently was messing with an LZ decoder, and comparing performance
results between several ways of moving data/values around:
using volatile and pointer casts;
using may_alias and pointer casts;
using memcpy;
using byte-oriented access.

This was via a wrapper library for various "fundamental" operators, such
as getting/setting values, or copying 8 or 16 bytes from a source
location to a destination (may overlap).

On GCC and Clang:
Volatile was ~ 20% faster than memcpy (~ 2.0GB/s vs 1.8GB/s);
Could not get may_alias to work correctly on GCC;
On clang, may_alias gave the same behavior as volatile;
Byte-oriented patterns were slower than memcpy (~ 1.5 GB/s).

On MSVC:
Volatile was fastest (~ 2.2 GB/s)
No obvious difference between volatile and a bare pointer.
Memcpy was again, slightly slower ( ~ 2.0 GB/s);
Byte-oriented access was somewhat worse ( ~ 800 MB/s ).

Bare pointer casts, with no 'volatile' or similar, in both GCC and
Clang, caused corrupted output data.

Interestingly, in GCC, "-fno-strict-aliasing" did not seem to have any
visible effect in this case, however optimization level did (-O1 and -Os
worked, -O2 and -O3 failed). It appears as if GCC was changing the
relative order of the memory loads and stores unless 'volatile' were used.

This was with my RP2 format, which (on x86) seems to be a little slower
than LZ4. The LZ4 command-line tool (written by Yann Collet) gave ~ 2.7
GB/s for the file I was testing (an arbitrary EXE file); though, for my
own LZ4 decoders, it tends to be a little closer.

I was also messing around with trying to get more speed from a 'ULZ'
codec, which was sort of like a hybrid of LZ4 and Deflate.

It uses Huffman coding with a similar structure for the encoded stream
to that of LZ4 (alternating runs of literal bytes and LZ matches).
Things like match lengths and distances use a similar encoding to the
distance encoding used in Deflate.

Unlike Deflate, Huffman symbols were limited to 12 bits (*).

*: With a 12-bit lookup, the lookup table is ~ 8K, and allows for the
fastest-case Huffman decoder to be a single table lookup. A 15-bit limit
(as in Deflate) gives slightly better compression, but makes the lookup
table considerably slower (the faster approach then being to use a
2-stage lookup). However, making the symbol length shorter than 12-bits
has a significant adverse effect on its general effectiveness.

With ULZ, was getting a decode speed of around 460 MB/s for the file in
the previous tests (though, working on making it faster did make the
code somewhat larger; a minimal decoder being around 300 lines, but the
"fast" decoder was more around 800 lines; most of this related to
bulkier code for the bitstream handling and symbol decoding).

In my tests, ULZ compression was in a similar area to Deflate.

The current ULZ design uses 3 Huffman tables, but I could consider
experimenting with a variant which uses 2 Huffman tables (could fit in
L1 a little easier).

...

These values were for my native PC, decoding speeds on BJX2 are
considerably slower (~ 7 MB/s at @ 50 MHz). None the less, it is still
fast enough to be usable for IO acceleration from the SDcard, though
doesn't usually offer a significant compression advantage over RP2
(which is somewhat faster).

Not much new ISA-wise; I suspect BJX2 is starting to reach a certain
degree of stability. Most recent change was adding a TEAH register,
which is mostly useful for dealing with XTLB misses (allowing the TLB to
send over the entire 96-bit address).

It could potentially be useful for extending the capabilities of
inter-processor interrupts (including significantly expanding the
encoding space for the CoreID for inter-core interrupts).

I am also having idle thoughts as to whether or not to resume work on an
x86 emulator for BJX2 (running x86 on BJX2 via a threaded-code VM / JIT).

Other thoughts are for whether or not to continue working on rewriting
the C library. Moving forward (eg: giving TestKern capabilities more
like a real OS) is likely to require some level of structural redesign
both to the C library and to TestKern as a whole. This would involve
some amount of "driving wedges", namely separating "front-end" and
"back-end" parts of the C library to be able to play nicer with things
like dynamic linking, as well as dividing the C library and TestKern
kernel into two independent entities (eg: not linking a copy of the
kernel with every binary).

Though, loading up binaries bare-metal is nice for debugging, I am
likely to need to move away from this practice.

In the re-imagined C library, a lot of the library will remain
statically linked with the binaries, however the library would be
divided into two major halves connected via internal VTables, with
another VTable providing OS level interfaces. When a DLL is loaded, it
will inherit its back-end VTables from the main binaries instance of the
C library. The OS VTable would mostly consist of syscall wrappers (would
partly replace the current use of wrapper functions and function pointers).

The change would be drastic enough that they couldn't be easily glued
onto the existing C library (a modified version of PDPCLIB), and for a
while I had been tempted to rewrite the C library anyways mostly to
reduce the amount of "cruft".

...

David Brown

unread,

Dec 18, 2021, 12:26:38 PM12/18/21

to

On 18/12/2021 05:00, BGB wrote:
> On 12/17/2021 7:51 AM, David Brown wrote:
>> On 17/12/2021 08:28, Thomas Koenig wrote:
>>> Ir. Hj. Othman bin Hj. Ahmad <oth...@gmail.com> schrieb:
>>>
>>>> HLL is very poor in bit manipulations. Better use a library for
>>>> these bit manipulations, so less need for optimization.
>>>
>>> Not really - calling a library function has a _lot_ of overhead,
>>> and what goes on in the library is hidden from the compiler. It is
>>> usually preferred for the compiler to emit highly optimized code.
>>>
>>
>> The best choice here is a header library (or equivalent for languages
>> other than C, C++ and friends). You can write your static inline
>> functions or templates in whatever you feel is the nicest way for your
>> needs, and re-use these functions with no overhead.
>>
>
> It depends some.
>
> One might end up also having a header which serves mostly to detect some
> compiler and machine specific defines and tune the logic for performance.
>
> Say, for GCC, one might want to use inline functions, but for MSVC it
> might be better to do everything with preprocessor macros, ...

Of course. If you expect to be using the header for multiple different
compilers, architectures, standards, optimisations, etc., then it is not
at all unreasonable to have some compiler detection and give specialised
versions for certain combinations and generic fall-backs for others.

It would be unexpected for a compiler to support inline functions but
not be able to handle them as efficiently as a macro. But some
compilers will handle particular ways of expressing the code more
efficiently, and in some cases you might have processor-specific
intrinsics, compiler extensions, or even inline assembly to get the
optimal code.

Compilers are free to change load and store orders on the assumption
that the current thread of execution is the only thing that ever sees
the accesses, unless you specifically say otherwise (with "volatile",
atomics, or other methods). More extreme movements are usually only
done at higher levels of optimisation. But if the code does not work
correctly at -O2, you can be pretty confident that your code is wrong.
Compiler bugs are not unheard of, of course, but user-code bugs are a
/lot/ more common!

BGB

unread,

Dec 18, 2021, 3:00:03 PM12/18/21

to

On 12/18/2021 11:26 AM, David Brown wrote:
> On 18/12/2021 05:00, BGB wrote:
>> On 12/17/2021 7:51 AM, David Brown wrote:
>>> On 17/12/2021 08:28, Thomas Koenig wrote:
>>>> Ir. Hj. Othman bin Hj. Ahmad <oth...@gmail.com> schrieb:
>>>>
>>>>> HLL is very poor in bit manipulations. Better use a library for
>>>>> these bit manipulations, so less need for optimization.
>>>>
>>>> Not really - calling a library function has a _lot_ of overhead,
>>>> and what goes on in the library is hidden from the compiler. It is
>>>> usually preferred for the compiler to emit highly optimized code.
>>>>
>>>
>>> The best choice here is a header library (or equivalent for languages
>>> other than C, C++ and friends). You can write your static inline
>>> functions or templates in whatever you feel is the nicest way for your
>>> needs, and re-use these functions with no overhead.
>>>
>>
>> It depends some.
>>
>> One might end up also having a header which serves mostly to detect some
>> compiler and machine specific defines and tune the logic for performance.
>>
>> Say, for GCC, one might want to use inline functions, but for MSVC it
>> might be better to do everything with preprocessor macros, ...
>
> Of course. If you expect to be using the header for multiple different
> compilers, architectures, standards, optimisations, etc., then it is not
> at all unreasonable to have some compiler detection and give specialised
> versions for certain combinations and generic fall-backs for others.
>

Yeah. It is possible to have the code run on generic / unknown
architectures, if albeit at a speed penalty.

There is a tradeoff though in that this can add a lot of cruft.

> It would be unexpected for a compiler to support inline functions but
> not be able to handle them as efficiently as a macro. But some
> compilers will handle particular ways of expressing the code more
> efficiently, and in some cases you might have processor-specific
> intrinsics, compiler extensions, or even inline assembly to get the
> optimal code.
>

In some cases, one may need local variables or to not have an argument
be evaluated multiple times, ... These cases favor inline functions.

But, if it is a wrapper over a pointer de-reference or similar, then a
macro typically works better.

This code is single threaded.

The code in question depends on the use of misaligned and overlapping
loads and stores. As a result, changing the relative load/store order
will change the visible results.

I was trying to use pointers declared as:
__attribute__((__may_alias__)) __attribute__((aligned(1)))

But, this did not work in GCC.

However, 'volatile' did give the expected behavior.

The above did work in Clang, however as far as I can tell it is behaving
the same in this case as it did for volatile.

One can also use a bunch of 8 and 16 byte memcpy operations, but as
noted, this is slower than using 'volatile'.

One of the major operations is basically to mimic the behavior of:
unsigned char *cs, *ct, *cte;
ct=dst; cs=dst-dist; cte=dst+len;
while(ct<cte)
{ *ct++=*cs++; }

However, doing it this way is slow enough to have a noticeable adverse
effect on the performance of an LZ decoder. Hence the use of a bunch of
misaligned load/store hackery to try to work around this.

So, the match copying tends to get a little more convoluted, pseudocode:
if(dist>=8)
{ copy data 16B at a time via 8B elements. }
else
{
if(!(dist&(dist-1)))
{
if(dist==1)
{ fill 8B with a single byte; }
else if(dist==2)
{ fill 8B with a 2B pattern; }
else if(dist==4)
{ fill 8B with a 4B pattern; }
Flood-fill target with pattern (power-of-2)
}else
{
if(dist==3)
{ fill 6B with a 3B pattern; step=6; }
else if(dist==5)
{ fill 6B with a 3B pattern; step=5; }
else ...
Flood-fill target with pattern (non-power-of-2)
}
}

And, typically for speed reasons, the match copying is allowed to stomp
memory slightly past the end of the output.

But, with raw pointer dereferencing (and no "volatile") something was
messing up here (namely, the expected LZ output would differ from what
was the expected output).

Without either "volatile" or "__may_alias__", both GCC and Clang were
messing up on this stuff, but this was to be expected.

On MSVC on x64, one may need "__unaligned" otherwise (at "/O2") it may
attempt to vectorize the copy loop.

Both "volatile" and "__unaligned" seem able to ward off the auto
vectorization, but using both does seem to make it "valid" according to
MSDN rules. Either keyword by itself seems able to give the intended
effect, and there is little obvious difference between using one of them
vs using both of them.

For bitstream manipulation, it is similar:
b=(ctx->win>>ctx->pos)&((1<<bits)-1);
ctx->pos+=bits;
while(ctx->pos>=8)
{
ctx->win=(ctx->win>>8)|((*ctx->cs++)<<24);
ctx->pos-=8;
}
Is a fair bit slower than, say:
cs=ctx->cs;
p=ctx->pos;
w=get_u32le(cs);
b=(w>>p)&((1<<bits)-1);
k=p+bits;
ctx->pos=k&7;
ctx->cs=cs+(k>>3);

Or, relatedly, for decoding 4 Huffman symbols:
cs=ctx->cs;
p=ctx->pos;
w=get_u64le(cs);
b=w>>p; c0=htab[b&4095]; p+=c0>>8; c0=c0&255;
b=w>>p; c1=htab[b&4095]; p+=c1>>8; c1=c1&255;
b=w>>p; c2=htab[b&4095]; p+=c2>>8; c2=c2&255;
b=w>>p; c3=htab[b&4095]; p+=c3>>8; c3=c3&255;
c0=c0|(c1<<8); c2=c2|(c3<<8);
ctx->pos=p&7; ctx->cs=cs+(p>>3);
c=c0|(c2<<16);

Where, getu32_le(ptr) might be, MSVC:
(*(volatile __unaligned uint32_t *)(ptr))

Or, what seemed to work on Clang (but not GCC):
(*(uint32_t __attribute__((__may_alias__))
__attribute__((aligned(1))) *)(ptr))

Or, seemed to work on both Clang and GCC:
(*(volatile uint32_t __attribute__((aligned(1))) *)(ptr))

Where, if one puts the attributes before the type-name, Clang gives a
warning that the attributes are ignored; but not if one puts them after
the type name.

For the "generic" fallback, "getu32_le()" and friends can fall back to
something like:
((ptr)[0]|((ptr)[1]<<8)|((ptr)[2]<<16)|((ptr)[3]<<24))

But, this results in a fairly massive drop in performance with MSVC.

However, interestingly, GCC sees much less of a performance impact with
this case.

This sort of hackery may make a pretty big difference in the speed of an
LZ decoder...

MitchAlsup

unread,

Dec 18, 2021, 4:33:46 PM12/18/21

to

Perhaps you should use the Virtual Vector Method where your copy loop
can execute 8-to16 iterations per clock. This should get rid of the lazy
performance problem.

>
> So, the match copying tends to get a little more convoluted, pseudocode:
> if(dist>=8)
> { copy data 16B at a time via 8B elements. }
> else
> {
> if(!(dist&(dist-1)))
> {
> if(dist==1)
> { fill 8B with a single byte; }
> else if(dist==2)
> { fill 8B with a 2B pattern; }
> else if(dist==4)
> { fill 8B with a 4B pattern; }
> Flood-fill target with pattern (power-of-2)
> }else
> {
> if(dist==3)
> { fill 6B with a 3B pattern; step=6; }
> else if(dist==5)
> { fill 6B with a 3B pattern; step=5; }
> else ...
> Flood-fill target with pattern (non-power-of-2)
> }
> }
<

And obviating the programmer from having to do time-wasting stuff like
the above.

>
> And, typically for speed reasons, the match copying is allowed to stomp
> memory slightly past the end of the output.
>
> But, with raw pointer dereferencing (and no "volatile") something was
> messing up here (namely, the expected LZ output would differ from what
> was the expected output).
>
>
> Without either "volatile" or "__may_alias__", both GCC and Clang were
> messing up on this stuff, but this was to be expected.
>
> On MSVC on x64, one may need "__unaligned" otherwise (at "/O2") it may
> attempt to vectorize the copy loop.
<

You can (CAN) vectorize the loop using VVM without having semantic errors
during execution.

>
> Both "volatile" and "__unaligned" seem able to ward off the auto
> vectorization, but using both does seem to make it "valid" according to
> MSDN rules. Either keyword by itself seems able to give the intended
> effect, and there is little obvious difference between using one of them
> vs using both of them.
>

What you want is vectorization that retains semantic content. CRAY-style
vectors fail at this, VVM succeeds.

BGB

unread,

Dec 18, 2021, 6:09:46 PM12/18/21

to

Possible, but N/A for x86-64...

On BJX2, BGBCC basically behaves similar to MSVC in these areas.

Assuming an unrolled byte-at-a-time copy loop, the fastest possible on
BJX2 would still only be ~ 25MB/s.

Copying a byte at a time via a "while()" loop is closer to 5MB/s.

It is considerably faster on a desktop PC running x86-64, but still a
lot slower than options which copy 8 or 16 bytes at a time.

>>
>> So, the match copying tends to get a little more convoluted, pseudocode:
>> if(dist>=8)
>> { copy data 16B at a time via 8B elements. }
>> else
>> {
>> if(!(dist&(dist-1)))
>> {
>> if(dist==1)
>> { fill 8B with a single byte; }
>> else if(dist==2)
>> { fill 8B with a 2B pattern; }
>> else if(dist==4)
>> { fill 8B with a 4B pattern; }
>> Flood-fill target with pattern (power-of-2)
>> }else
>> {
>> if(dist==3)
>> { fill 6B with a 3B pattern; step=6; }
>> else if(dist==5)
>> { fill 6B with a 3B pattern; step=5; }
>> else ...
>> Flood-fill target with pattern (non-power-of-2)
>> }
>> }
> <
> And obviating the programmer from having to do time-wasting stuff like
> the above.

It would be nicer if the underlying operation were supported by the C
library. As noted, wither "memcpy()" nor "memmove()" implement this
behavior.

I did before add a "_memlzcpy()" function which had an interface like a
normal "memcpy()" but which had behavior more like what was needed for a
typical LZ77 match copy. I am not aware of any other C libraries doing
this though.

>>
>> And, typically for speed reasons, the match copying is allowed to stomp
>> memory slightly past the end of the output.
>>
>> But, with raw pointer dereferencing (and no "volatile") something was
>> messing up here (namely, the expected LZ output would differ from what
>> was the expected output).
>>
>>
>> Without either "volatile" or "__may_alias__", both GCC and Clang were
>> messing up on this stuff, but this was to be expected.
>>
>> On MSVC on x64, one may need "__unaligned" otherwise (at "/O2") it may
>> attempt to vectorize the copy loop.
> <
> You can (CAN) vectorize the loop using VVM without having semantic errors
> during execution.
>>
>> Both "volatile" and "__unaligned" seem able to ward off the auto
>> vectorization, but using both does seem to make it "valid" according to
>> MSDN rules. Either keyword by itself seems able to give the intended
>> effect, and there is little obvious difference between using one of them
>> vs using both of them.
>>
> What you want is vectorization that retains semantic content. CRAY-style
> vectors fail at this, VVM succeeds.

MSVC seemingly does surprisingly OK at transforming arbitrary code into
a horrible mess of SSE instructions without breaking semantics (it seems
to often put in guard-checks when it does stuff like this).

However, the assumptions that it makes are not always correct (*), and
in the case of LZ match copying, it tends to be detrimental (and
volatile makes it slightly faster in this case).

*: It seems to be rather prone to assume that this is going to be a
giant long-running loop, rather than a short loop being invoked at a
high frequency.

The situation is not exactly the same for BGBCC, which lacks any sort of
auto vectorization (and I have no immediate plans to add this). Volatile
can be detrimental in some cases (it can result in needless register
spills, etc). But, when used as part of a pointer cast, still has
basically the desired effect.

As can be noted, one optimization strategy is "unroll stuff and throw
lots of variables at the problem":
This is often worse than a naive approach on 32-bit x86;
This often works well on x86-64;
This often works extra good on BJX2.
...

In general, 32-bit ARM seems to behave like 32-bit x86 in that it tends
to favor logic which limits the number of in-flight variables.

MitchAlsup

unread,

Dec 18, 2021, 6:50:55 PM12/18/21

to

Using VVM the copy loop should be able to perform at cache-line-access-width
per cycle even when the loop is constructed with LDB and STB (and without
even having to do alias analysis between to and from.)

<
>
> It is considerably faster on a desktop PC running x86-64, but still a
> lot slower than options which copy 8 or 16 bytes at a time.
<

Why not copy 64-bytes per cycle ?
Answer: I don't have a 64-byte wide register!
Retort: You don/'t need a 64-byte wide register to copy 64-bytes per cycle
using VVM.

<
> >>
> >> So, the match copying tends to get a little more convoluted, pseudocode:
> >> if(dist>=8)
> >> { copy data 16B at a time via 8B elements. }
> >> else
> >> {
> >> if(!(dist&(dist-1)))
> >> {
> >> if(dist==1)
> >> { fill 8B with a single byte; }
> >> else if(dist==2)
> >> { fill 8B with a 2B pattern; }
> >> else if(dist==4)
> >> { fill 8B with a 4B pattern; }
> >> Flood-fill target with pattern (power-of-2)
> >> }else
> >> {
> >> if(dist==3)
> >> { fill 6B with a 3B pattern; step=6; }
> >> else if(dist==5)
> >> { fill 6B with a 3B pattern; step=5; }
> >> else ...
> >> Flood-fill target with pattern (non-power-of-2)
> >> }
> >> }
> > <
> > And obviating the programmer from having to do time-wasting stuff like
> > the above.
<
> It would be nicer if the underlying operation were supported by the C
> library. As noted, wither "memcpy()" nor "memmove()" implement this
> behavior.
<

MY 66000 compiler (Brian's) directly converts byte-wide copies into VVM code
and gets as much perf as the HW can muster (typically ½ cache line width
per cycle for low end machines, and at least cache line width on large machines.
<
Look, you have your own architecture and are not restricted to x86. Do better
than you are doing, copy VVM if you like, but do something better--something
that scales from the smallest implementation you can imagine to the largest
and be source code compatible throughout the range.

>
> I did before add a "_memlzcpy()" function which had an interface like a
> normal "memcpy()" but which had behavior more like what was needed for a
> typical LZ77 match copy. I am not aware of any other C libraries doing
> this though.
<

My guess is that if you did VVM, it would give you the vast majority of what
you want.

> >>
> >> And, typically for speed reasons, the match copying is allowed to stomp
> >> memory slightly past the end of the output.
> >>
> >> But, with raw pointer dereferencing (and no "volatile") something was
> >> messing up here (namely, the expected LZ output would differ from what
> >> was the expected output).
> >>
> >>
> >> Without either "volatile" or "__may_alias__", both GCC and Clang were
> >> messing up on this stuff, but this was to be expected.
> >>
> >> On MSVC on x64, one may need "__unaligned" otherwise (at "/O2") it may
> >> attempt to vectorize the copy loop.
> > <
> > You can (CAN) vectorize the loop using VVM without having semantic errors
> > during execution.
> >>
> >> Both "volatile" and "__unaligned" seem able to ward off the auto
> >> vectorization, but using both does seem to make it "valid" according to
> >> MSDN rules. Either keyword by itself seems able to give the intended
> >> effect, and there is little obvious difference between using one of them
> >> vs using both of them.
> >>
> > What you want is vectorization that retains semantic content. CRAY-style
> > vectors fail at this, VVM succeeds.
<
> MSVC seemingly does surprisingly OK at transforming arbitrary code into
> a horrible mess of SSE instructions without breaking semantics (it seems
> to often put in guard-checks when it does stuff like this).
<

Let's see MSVC vectorize this::
<
for( i = 0; i < MAX; i++ )
to[i] = from[MAX-i-1];
<
!
<
VVM can........

>
> However, the assumptions that it makes are not always correct (*), and
> in the case of LZ match copying, it tends to be detrimental (and
> volatile makes it slightly faster in this case).
<

VVM obeys memory aliasing "as if" the code were performed 1 instruction
at a time. Wide register vectors cannot, CRAY style vectors cannot.

David Brown

unread,

Dec 19, 2021, 7:51:45 AM12/19/21

to

On 18/12/2021 20:59, BGB wrote:
> On 12/18/2021 11:26 AM, David Brown wrote:

>> Compilers are free to change load and store orders on the assumption
>> that the current thread of execution is the only thing that ever sees
>> the accesses, unless you specifically say otherwise (with "volatile",
>> atomics, or other methods). More extreme movements are usually only
>> done at higher levels of optimisation. But if the code does not work
>> correctly at -O2, you can be pretty confident that your code is wrong.
>> Compiler bugs are not unheard of, of course, but user-code bugs are a
>> /lot/ more common!
>>
>
> This code is single threaded.
>
> The code in question depends on the use of misaligned and overlapping
> loads and stores. As a result, changing the relative load/store order
> will change the visible results.
>

Compilers assume you follow the rules for alignments - and then
overlapping cannot possibly occur (for types smaller than the maximum
alignment sizes). I don't think there is any way to construct objects
that overlap while remaining within the strict rules of C (by which I
mean, without invoking behaviour that is not defined by the standards.
Particular compilers might support it, however).

>
> I was trying to use pointers declared as:
> __attribute__((__may_alias__)) __attribute__((aligned(1)))
>
> But, this did not work in GCC.

No, it will not. "may_alias" does not change alignment rules or allow
overlapping. And the "aligned" attribute is for /increasing/ alignment,
not reducing it. You might have better luck using a struct with the
"packed" attribute - that's the only way in gcc to work with unaligned
data in a fully defined way, I believe.

>
> However, 'volatile' did give the expected behavior.

"volatile" tells the compiler to make the access you gave, regardless of
anything else, and can be used to force unaligned reads and writes
(amongst other things).

>
> The above did work in Clang, however as far as I can tell it is behaving
> the same in this case as it did for volatile.
>
>
> One can also use a bunch of 8 and 16 byte memcpy operations, but as
> noted, this is slower than using 'volatile'.
>

>
>

> For the "generic" fallback, "getu32_le()" and friends can fall back to
> something like:
> ((ptr)[0]|((ptr)[1]<<8)|((ptr)[2]<<16)|((ptr)[3]<<24))
>
> But, this results in a fairly massive drop in performance with MSVC.
>
> However, interestingly, GCC sees much less of a performance impact with
> this case.
>
>

I recommend playing in <https://godbolt.org> and trying different
combinations here - you can quickly compare the generated results for
different compilers, flags, and code.

Terje Mathisen

unread,

Dec 19, 2021, 2:37:41 PM12/19/21

to

I love those compiler micro-optimizations, they remove the need for
inline asm to handle these things.

MitchAlsup

unread,

Dec 19, 2021, 3:29:09 PM12/19/21

to

On Sunday, December 19, 2021 at 1:37:41 PM UTC-6, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Friday, December 17, 2021 at 2:42:25 PM UTC-6, Terje Mathisen wrote:
> >> I'll typically load 32 or 64 bits surrounding the target are, swap the
> >> bytes if there's an endian change, then use regular shift & mask ops to
> >> get at what I need.
> > <
> > We teach compilers to recognize << followed by >> and convert it into
> > EXT instructions.
> > We teach compilers to recognize & followed by << followed by &~
> > followed by | and convert it into INS instructions (when appropriate).
<
> I love those compiler micro-optimizations, they remove the need for
> inline asm to handle these things.
<

No reason to change the language or library interfaces to access easy
to identify idioms. I modified the M88K compiler to do these in 1984.
All it takes it recognizing a slightly more complicated pattern when
selecting code to be emitted. They typically, also, take a minor amount
of pressure off the register allocator.

BGB

unread,

Dec 19, 2021, 3:58:15 PM12/19/21

to

On 12/19/2021 6:51 AM, David Brown wrote:
> On 18/12/2021 20:59, BGB wrote:
>> On 12/18/2021 11:26 AM, David Brown wrote:
>
>>> Compilers are free to change load and store orders on the assumption
>>> that the current thread of execution is the only thing that ever sees
>>> the accesses, unless you specifically say otherwise (with "volatile",
>>> atomics, or other methods). More extreme movements are usually only
>>> done at higher levels of optimisation. But if the code does not work
>>> correctly at -O2, you can be pretty confident that your code is wrong.
>>> Compiler bugs are not unheard of, of course, but user-code bugs are a
>>> /lot/ more common!
>>>
>>
>> This code is single threaded.
>>
>> The code in question depends on the use of misaligned and overlapping
>> loads and stores. As a result, changing the relative load/store order
>> will change the visible results.
>>
>
> Compilers assume you follow the rules for alignments - and then
> overlapping cannot possibly occur (for types smaller than the maximum
> alignment sizes). I don't think there is any way to construct objects
> that overlap while remaining within the strict rules of C (by which I
> mean, without invoking behaviour that is not defined by the standards.
> Particular compilers might support it, however).
>

The goal in this case was seeing what would work in the various compilers.

>>
>> I was trying to use pointers declared as:
>> __attribute__((__may_alias__)) __attribute__((aligned(1)))
>>
>> But, this did not work in GCC.
>
> No, it will not. "may_alias" does not change alignment rules or allow
> overlapping. And the "aligned" attribute is for /increasing/ alignment,
> not reducing it. You might have better luck using a struct with the
> "packed" attribute - that's the only way in gcc to work with unaligned
> data in a fully defined way, I believe.
>

There is also the "memcpy" option, say:
inline uint32_t getu32_le(void *ptr)
{
uint32_t v;
memcpy(&v, ptr, 4);
return(v);
}

However, despite memcpy and inline being (often) essentially free in
GCC, the end result still ended up roughly 20% slower.

I guess maybe one could use a packed struct like, say:
typedef struct ui32ps_s __attribute__((packed)) {
uint32_t v;
}ui32ps_t;

#define getu32_le(ptr) (((ui32ps_t *)(ptr))->v)

Haven't tried this yet.

The attempt to use 'aligned(1)' was to see if it would behave like
'__unaligned' in some other compilers, but GCC has neither '__unaligned'
nor '__packed', ...

In BGBCC, I followed a rule similar to MSVC, in that either __unaligned
or __packed may allow for unaligned pointers (they are treated as
equivalent in this case); otherwise pointers are assumed to be natively
aligned.
This mostly effects things like whether the compiler is allowed to use
'MOV.X' operations in BJX2, since (unlike other loads/stores) MOV.X
imposes an 8B alignment restriction (*).

*: Mostly because allowing a freely aligned 128-bit load/store would
have a significant adverse effect on the cost of the L1 D$.

Namely one can imagine that each 128-bit L1 line is effectively divided
in half, and one uses a pair of these half-lines to implement an
unaligned load/store (selecting/inserting 64 bits from anywhere within
the 128-bit pair). In this case, the MOV.X operation simply skips this
select/insert stage and operates more directly on the pair of half-lines.

Loads will sign or zero extend the value as requested. Stores will copy
the high-order bits (for insert) from the 'select' results. A new pair
of half-lines will be generated by "reinserting" this 64-bit value back
in at the target pair (which retain any unchanged bits from the select
pair).

In cases where an unaligned 128b load/store is needed, it can be
implemented using a pair of MOV.Q instructions.

By extension, alignment is imposed in a few other areas (Reg: HW, ABI):
SP: 8, 16
PC: 2/4*, 2/4*, (*: 4 is required for inter-ISA branches with RVX)
GBR: 8, 16
TBR: 8, 16

>>
>> However, 'volatile' did give the expected behavior.
>
> "volatile" tells the compiler to make the access you gave, regardless of
> anything else, and can be used to force unaligned reads and writes
> (amongst other things).
>

Yeah.

Volatile seemed to be the winner in these tests.

>>
>> The above did work in Clang, however as far as I can tell it is behaving
>> the same in this case as it did for volatile.
>>
>>
>> One can also use a bunch of 8 and 16 byte memcpy operations, but as
>> noted, this is slower than using 'volatile'.
>>
>
>>
>>
>> For the "generic" fallback, "getu32_le()" and friends can fall back to
>> something like:
>> ((ptr)[0]|((ptr)[1]<<8)|((ptr)[2]<<16)|((ptr)[3]<<24))
>>
>> But, this results in a fairly massive drop in performance with MSVC.
>>
>> However, interestingly, GCC sees much less of a performance impact with
>> this case.
>>
>>
>
> I recommend playing in <https://godbolt.org> and trying different
> combinations here - you can quickly compare the generated results for
> different compilers, flags, and code.
>

When I looked at it before, GCC was able to transform this pattern into
a plain memory load. MSVC does not seem to do this.

Still leaves an open mystery though of why it would be slower (in GCC)
than the use of 'volatile' or 'memcpy', since it should otherwise be
expected to be the same in this case.

But, in any case, the hit is a lot smaller than the one MSVC takes here.

Stephen Fuld

unread,

Dec 20, 2021, 11:46:40 AM12/20/21

to

A compiler I used to use (it was for Fortran, but this functionality is
not Fortran specific) had a pseudo function named "Bits". It looked
like a function that accepted three arguments, a variable name and a
starting bit number and number of bits, both of which could be a
constant or an integer variable. One of its unique properties is that
it could also be used on the left hand side of an assignment statement.

This feature led to very clear, concise code, that the compiler didn't
have to go through a complex algorithms to recognize, and would generate
optimal, inline code. There was sort of a pragma that could optionally
generate checking code for out of range combinations of length and
starting bit number.

So, for example, the source code to move two bits starting at bit 3 of
variable B to starting at bit 4 of variable A would be

Bits (A, 2, 4) = Bits (B, 2, 3)

I always liked this feature, and used it a fair amount.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

unread,

Dec 20, 2021, 12:25:30 PM12/20/21

to

A more livable syntax might be:
<
A<2:4> = B<2:3>

James Van Buskirk

unread,

Dec 20, 2021, 1:21:25 PM12/20/21

to

"Stephen Fuld" wrote in message news:spqc1d$oio$1...@dont-email.me...

> A compiler I used to use (it was for Fortran, but this functionality is
> not Fortran specific) had a pseudo function named "Bits". It looked
> like a function that accepted three arguments, a variable name and a
> starting bit number and number of bits, both of which could be a
> constant or an integer variable. One of its unique properties is that
> it could also be used on the left hand side of an assignment statement.

> This feature led to very clear, concise code, that the compiler didn't
> have to go through a complex algorithms to recognize, and would generate
> optimal, inline code. There was sort of a pragma that could optionally
> generate checking code for out of range combinations of length and
> starting bit number.

> So, for example, the source code to move two bits starting at bit 3 of
> variable B to starting at bit 4 of variable A would be

> Bits (A, 2, 4) = Bits (B, 2, 3)

> I always liked this feature, and used it a fair amount.

Compare
https://gcc.gnu.org/onlinedocs/gfortran/IBITS.html#IBITS
https://gcc.gnu.org/onlinedocs/gfortran/MVBITS.html#MVBITS
which are both elemental.

Stephen Fuld

unread,

Dec 20, 2021, 2:04:58 PM12/20/21

to

I have no problem with that. The main point is to eliminate the need
for specifying some sequence of several shift and mask operations in the
source code. Both your suggestion and mine are clearer and less error
prone for the programmer.

Stephen Fuld

unread,

Dec 20, 2021, 2:12:54 PM12/20/21

to

They seem essentially equivalent. It says that these came in with
Fortran 90. The compiler I was referring to had its roots in a
predecessor with a different name that dated back to at least the early
1970s. I am glad the idea became standardized.

I guess I should ask the question that given that C is probably used for
bit manipulation stuff at least as much as Fortran, how come the C
standard didn't pick up some equivalent?

MitchAlsup

unread,

Dec 20, 2021, 3:06:53 PM12/20/21

to

{Sarcasm mode=on}
<
Having a straightforward way to succinctly express the requirement is
<ahem> Not 'C'.
<
{Sarcasm mode=stays on}........

Marcus

unread,

Dec 21, 2021, 1:21:58 AM12/21/21

to

How about this quick hack in C++... ;-)

It compiles to bit field instructions on ARM:
https://godbolt.org/z/78K3s1cs1

#include <cstring>

// Helper struct.
template <unsigned POS, unsigned LEN, typename T>
struct ipos_helper {
T _1 : POS;
T x : LEN;
};

// Extract bit field.
template <unsigned POS, unsigned LEN, typename T>
T ipos(T x) {
ipos_helper<POS, LEN, T> helper;
static_assert(sizeof(helper) >= sizeof(T));
std::memcpy(&helper, &x, sizeof(T));
return helper.x;
}

unsigned test(unsigned num) {
return ipos<4, 9>(num);
}

/Marcus

Terje Mathisen

unread,

Dec 21, 2021, 6:02:01 AM12/21/21

to

My guess would be that since this effectively requires a fixed bit
ordering within bytes/words, and we already had cpus numbering the bits
from the top as well as up from the bottom, this one was a non-starter
since it would penalize at least one of the alternatives?

Andreas Eder

unread,

Dec 21, 2021, 6:30:04 AM12/21/21

to

On Mo 20 Dez 2021 at 08:46, Stephen Fuld <sf...@alumni.cmu.edu.invalid> wrote:

> A compiler I used to use (it was for Fortran, but this functionality is
> not Fortran specific) had a pseudo function named "Bits". It looked
> like a function that accepted three arguments, a variable name and a
> starting bit number and number of bits, both of which could be a
> constant or an integer variable. One of its unique properties is that
> it could also be used on the left hand side of an assignment statement.
>
> This feature led to very clear, concise code, that the compiler didn't
> have to go through a complex algorithms to recognize, and would generate
> optimal, inline code. There was sort of a pragma that could optionally
> generate checking code for out of range combinations of length and
> starting bit number.
>
> So, for example, the source code to move two bits starting at bit 3 of
> variable B to starting at bit 4 of variable A would be
>
> Bits (A, 2, 4) = Bits (B, 2, 3)
>
> I always liked this feature, and used it a fair amount.

This reminds me of the ldb and dpb functions in Common Lisp:
(http://www.lispworks.com/documentation/HyperSpec/Body/f_dpb.htm
http://www.lispworks.com/documentation/HyperSpec/Body/f_ldb.htm)

'Andreas

Stephen Fuld

unread,

Dec 21, 2021, 11:23:10 AM12/21/21

to

Could be. I think there are two issues here. Big/little where the
actual order in memory is different, and numbering conventions, where
only the names change. Coding a sequence of shifts and masks is also
not portable to big/little endian designs, etc. so this is no different.

As for the numbering conventions, one could have the compiler be
sensitive to different ordering conventions and change the meaning of
the arguments appropriately depending upon target architecture. This
would make coding for any particular architecture consistent with that
architecture's conventions, but probably doesn't help portability, and I
am not sure it is a good idea. But I am not sure that "hiding" the
differences under a sequence of shifts and masks is any better.

EricP

unread,

Dec 21, 2021, 11:44:04 AM12/21/21

to

In C the order in which bit fields in a struct are concatenated
is implementation dependent. It matters for unions.

In Verilog I think almost everything is little endian except,
strangely, the order of bit fields in a struct which is big endian.
(I find it strange because it is such a glaring wart that you'd think
someone would have noticed it during the language design phase.)

MitchAlsup

unread,

Dec 21, 2021, 12:31:06 PM12/21/21

to

Verilog is biEndian (or Endian Free).
<
One can write:
<
a[3:5] = b[5:3];
<
And the bits are reversed !
<
In Verilog bits are just wires (most of the time), wires can be taken in any order:
<
a[3:12] = (b[5:3],b[6:12]);
<
Which reverses 3 bits and copies 6 more. It really helps just to think of them as wires
and assignment as a wiring order.

EricP

unread,

Dec 21, 2021, 2:55:17 PM12/21/21

to

Certain things are hardwired to behave little endian.

Yes for a single field you can declare that the bits are named
from left-low to right-high, or left-high to right-low,
which is endian neutral.

However I got the impression that if you actually tried to use
big endian ordering that certain things would behave anomalously.
e.g. while bits may be named in big endian high to low order,
array elements are always numbered in little endian low to high order.
A unioned array the lowest index always accesses the lsb of an overlay,
irrespective of naming of element bits.

For example

typedef union packed
{
bit [0:7] vecA[10:1]; // 10 big endian bytes
bit vecB[80]; // 80 bits numbered [79:0]
} MyUnion;

that the lsb of vecA[1][7:7] would overlay the lsb of vecB[0].

Also while the fields of a struct are packed in left to right or
big endian order, the whole of a struct or union can be accessed
as a bit field with range [n-1:0] which is little endian order.

It looks to me that since the language shows a preference for
little endian that one should stick to it to avoid gotcha's.
And just watch out for the order of struct field packing.

MitchAlsup

unread,

Dec 21, 2021, 6:38:04 PM12/21/21

to

Perhaps it is best said that Verilog is Endian neutral,
and that some things are more naturally expressed in Little Endian.

EricP

unread,

Dec 22, 2021, 1:08:36 PM12/22/21

to

A more straight forward example:
a 16 bit halfword unioned onto a 2 byte array.
In Verilog array[0] overlays the least significant byte of the word as it
does on a little endian machine, irrespective of how the bits are numbered.

With LE bit numbering a bit address trivially chops into a
byte/halfword/word array index and bit offsett, to denote the same bit.
With BE bit numbering a bit address selects different bits
depending on whether it is a byte or half or word array index.
That is where the potential gotcha's creep in.

Tim Rentsch

unread,

Dec 22, 2021, 3:24:46 PM12/22/21

to

Not necessarily an answer but an observation: in C no new language
feature is needed, because C has the C preprocessor. It isn't hard
to write a macro that does the necessary shifting and masking (and
these days compilers are mostly smart enough to recognize the common
patterns and translate them efficiently and in machine-specific
ways).

Stephen Fuld

unread,

Dec 23, 2021, 12:07:45 PM12/23/21

to

While you are right, of course, about the preprocessor, I guess it comes
down to me having a different philosophy of language design, than what
Mitch expressed in his (sarcastic) comment.

> Having a straightforward way to succinctly express the requirement is
> <ahem> Not 'C'.

The problem with the preprocessor approach is that everyone will code
their own macro, possibly with different syntax, so I have to learn what
you did, then if I move to a different project, what they did etc. That
is the beauty of a standard. (Unless you are talking about having the
preprocessor macro standardized.)

Compare with, for example, integer division. It is part of the
standard, even though fairly rarely used. If bit field extraction and
insertion are reasonably well used (and I am guessing they are - at
least as much as they are used in Fortran, though probably less than
division), then I feel they should be standardized to ease the
programmer's job.

Guillaume

unread,

Dec 23, 2021, 2:08:39 PM12/23/21

to

Le 23/12/2021 à 18:07, Stephen Fuld a écrit :
> Compare with, for example, integer division. It is part of the
> standard, even though fairly rarely used.

Sorry, I'm not sure I got it? Integer division is rarely used in C? Can
you elaborate?

> If bit field extraction and
> insertion are reasonably well used (and I am guessing they are - at
> least as much as they are used in Fortran, though probably less than
> division), then I feel they should be standardized to ease the
> programmer's job.

It's probably up for debate, of course, like any new "feature" that
could be added to C. For instance, stdint was added to C99... but this
was a bit different: integer widths were mostly implementation-defined
before that. Standardizing things that are commonly used and rely on
implementation-defined stuff makes sense.

But "bit manipulation" in general is something that can be implemented,
off the top of my head, in a portable way, using existing operators,
without risking incorrect behavior if you use them properly. So adding
bit manipulation functions to the standard would not add tremendous
value. Sure it would marginally help developers, but as far as I've
seen, the process of adding a new feature in the C standard usually
implies that it can't be done portably otherwise. Again, extracting bits
can be done portably as far as I see it. (Don't hesitate to prove me
wrong with an example!)

MitchAlsup

unread,

Dec 23, 2021, 2:37:41 PM12/23/21

to

We have to separate this into two (2) categories: static bit fields which
C already has, and dynamic bit fields which C does not have.
<
Thus, we have already taught the compilers how to recognize these patterns
and generate bit field codes. Static pattern recognition is easier than dynamic;
but if you can recognize the dynamic cases, you automatically recognize the
static. I first did this in 1983 for DENELCORE and I believe that others got there
before I did.
<
The only benefit to developers would be some kind of syntactic sugar to make
bit fields easier to specify::
<
a<3:17> = b<9:23>;
<
Macros seem to fail in that they cannot do syntactic sugar things.

Thomas Koenig

unread,

Dec 23, 2021, 3:47:40 PM12/23/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> The only benefit to developers would be some kind of syntactic sugar to make
> bit fields easier to specify::
><
> a<3:17> = b<9:23>;
><
> Macros seem to fail in that they cannot do syntactic sugar things.

Macros in C can be used either for syntactic sugar or for syntactic
cyanide.

The original Bourne shell is an example. I'll leave it up to
the reader to decide if the C source code

IF wdval==NL
THEN IF flg&NLFLG
THEN wdval=';'; chkpr(NL);
FI
ELIF i==0 ANDF (flg&MTFLG)==0
THEN synbad();
FI

is the former or the latter.

Guillaume

unread,

Dec 23, 2021, 6:10:07 PM12/23/21

to

Le 23/12/2021 à 20:37, MitchAlsup a écrit :
> The only benefit to developers would be some kind of syntactic sugar to make
> bit fields easier to specify::
> <
> a<3:17> = b<9:23>;

Oh, I perfectly get that, and I agree this notation would be quite nice.

> Macros seem to fail in that they cannot do syntactic sugar things.

Not this kind, indeed. But the whole question regarding the standard is,
is it really worth it to make additions to it just for some syntactic
sugar? This is not the spirit of C.

But in that same vein, I would have a number of other notations I would
like to have in C. A lot more than just this. Thing is, who knows where
that would lead the language.