IME, in terms of code density x86-64 seems to be pretty weak. Though, it
varies some, for example, "gcc -Os" on Linux does somewhat better than MSVC.
In my own comparisons, 32-bit x86 and Thumb2 tend to do a lot better.
A64 seems to do a bit worse here than Thumb2.
As a small example, I had recently hacked together a small voxel based
3D engine along vaguely similar lines to Minecraft Classic for the BJX2
ISA (*1).
A build of the engine for x86-64 is 838K (via MSVC), whereas the BJX2
build is 175K. Not strictly apples-to-apples, but still...
*1: Its renderer basically sweeps across the screen doing ray-casts,
building up a list of any blocks hit by a ray-cast, and then drawing the
list of blocks (via software-rasterized OpenGL). It has minimal overdraw
(because a raycast will not pass through a wall), but only really works
effectively at small draw distances.
Its performance still manages to be somehow less awful than I originally
imagined (framerates are a little better than Quake, albeit at a 24
block draw-distance). Ended up using a color-fill sky, mostly as this
improves framerate somewhat if compared with drawing a skybox.
This is still with me discovering and occasionally fixing "crazy bad"
compiler bugs, eg:
Turns out the compiler was very-frequently trying to cast-convert
operands of binary operators to the destination type even when they were
the same (resulting in a lot of extra register MOVs, spills, ...).
Eg, if you did something like:
int a, b, c;
c=a+b;
It was tending to often compile it like it were:
c=(int)a+(int)b;
Which at the ASM level would, instead of, say:
ADDS.L R8, R9, R14
Result in something like:
MOV R8, R25
MOV R9, R28
ADDS.L R25, R28, R14
And would also result in higher register pressure and a larger number of
spills.
Fixing this bug gave a roughly 4% reduction in the size of binaries, and
a roughly 20% increase in performance for Doom and similar. This also
caused Dhrystone score to increase from ~51.3k to ~57.1k.
It seemed this was also related to a lot of cases where, say:
c=a+imm;
Was resulting in things like:
MOV Imm, R6
MOV R7, R9
ADDS.L R12, R9, R13
Rather than, say:
ADDS.L R12, Imm, R13
...
Then, relatedly, noted that, eg:
y=x&255;
Was being compiled sorta like:
MOV R11, R7
EXTU.B R7, R7
MOV R7, R28
MOV R28, R14
Vs, say:
EXTU.B R11, R14
Turns out this was stumbling on some logic for a "stale" code-path,
where early on, the compiler would handle operators more like:
Allocate scratch registers;
Load frame variables into scratch registers;
Apply operator to scratch registers;
Store result back to call frame;
Free said scratch registers.
But, this was later replaced with:
Fetch the variables as registers;
Operate on these registers;
Release the registers.
After I had switched over, trying to load/store a frame variable in this
way would typically result in a register MOVs rather than an actual
memory load/store. These older paths have not been entirely eliminated
though.
But, yeah, fixing these appears to have slightly reduced the level of
"general awfulness" in my C compiler output.
Then added another slight compiler tweak which got it up to ~57.9k
(namely, caching and reusing struct-field loads in certain cases).
Was able to push it up to ~ 59.0k by assuming less-conservative
semantics ("strict aliasing"), but I decided against enabling this by
default as it seems unsafe. This mostly effects under which conditions
the cached struct field would be discarded.
At present, it has certain restrictions:
* Does not cross a basic-block boundary;
* Discarded if either the of the cached variables is modified;
* Discarded if any sort of explicit memory store happens;
* ...
But, what it will do, is compile an expression like:
y=foo->x*foo->x;
As if it were:
t0=foo->x;
y=t0*t0;
Though, this optimization does not appear to have any real effect on
Doom and similar.
It is possible a similar trick could be used for array loads or pointer
derefs.
I also went and recently optionally re-added the "FMOV.S" instruction
(Memory Load/Store combined with a Single<->Double conversion), since
this should be able to help some for code which works with
single-precision floating point values (avoids some common penalty cases).
...
>>> A variable width instruction set can support chaining today by just adding
>>> the instructions, I am perplexed as to why no one has.
>> <
>> Why don't you give it a go and see what comes out ?
>
> Chaining opcodes is complex homework that every company should have taken a
> look at, including you, results should be somewhere on the internet.
>
> Getting the RISC guys to go variable width was worse than pulling teeth.
> Threats of firings and resignations were involved at ARM, and the MIPS
> founder did fire people for such suggestions, though most were weeded out
> at hiring interviews leading to brain dead group think that killed the
> company when the market changed.
>
> Adding a chaining register dependency is maybe 10 times worse in these
> peoples minds.
>
It is balance, some.
16/32, by looking at a few bits, is OK.
Decoding a bundle based on also looking at a few bits and daisy-chaining
is also OK.
Fully variable length encodings which depend on looking at lots of
different bits are less OK.
>>> A new architecture needs a hook to get noticed and dominating instruction
>>> density is one way to get that notice.
>> <
>> My guess is that lower context switch overhead would garner more wins
>> than instruction density; for example, an ADA call to an entry accept point
>> in a different address space costing only 12 cycles.
>
> Sounds good.
>
>> <
>>>> RISC only made sense for a decade back in the ancient history of the
>>>> 1980’s.
>> <
>> RISC made sense in the brief interval when 32-bit ISAs were too complicated
>> to all be on 1 chip. By shedding the area of the microcode, one got the space
>> to build pipeline registers; and instead of 1 instruction every 4-6 cycles, one
>> got 1 instructions every clock (less cache miss latency).
>
> Agree.
>
FWIW: In theory, my BJX2 core can do ~ 2 or 3 instructions per cycle.
Though, actual "real-world" results tend to be closer to 0.3 to 0.5 ...
A lot of this is due to cache misses, interlock penalties, and my
compiler mostly failing to bundle instructions.
Can generally get better results with ASM though.
>>>> Today if I wanted to build a better 16 or 32 bit processor the first step
>>>> would be to find what micro coded instructions I could add to reduce
>>>> instruction density, and thus win the lowest cost war.
>> <
>> In My case (My 66000) the biggest code density benefit was in creating
>> ENTER and EXIT instructions, second best was giving every instruction
>> access to any width immediate.
>> <
>> Secondarily, I doubt the 16-bit market is in the search for a new architecture.
>
> The 16 bit market has the worst choices to pick from, the problem is that
> these devices are cheap, so everyone ignores this market. Definition of
> opportunity.
>
> In the 32 bit market you have to compete with RISC-V which is free. Sure it
> is crap compared to the new architectures discussed here, but it’s free.
>
> Pick your poison. ;)
>
For 16-bit, one mostly wants "as cheap as possible".
Though, at least for off-the-shelf microcontrollers, it is hard to
really beat out something like "just use an MSP430 or similar".
One can design a "better" 16-bit ISA, and then, say, run it on an ICE40
or similar, but then unless one has a strong use-case to justify needing
FPGA logic, the ICE40 costs more and is more complicated to use than an
MSP430.
Then a certain amount of "use a Cortex-M but treat it like it is a
16-bit ISA".
Or, if one does custom silicon, how to they get enough "volume" and
"momentum" to make it cost-effective vs existing options? ...
Then again, I am doing my existing project more because I found it
interesting, than does it necessarily make sense.
For many things, a Cortex-M would be both faster and cheaper.
Though, it isn't 1:1, because while a (higher end) Cortex-M dev-board
can do a pretty decent job running Doom or similar, if trying to do
something like an OpenGL style software rasterizer or similar on it, it
falls on its face.
Seemingly, the Thumb ISA does rather poorly on workloads that end up
consisting almost entirely of memory loads and stores.