Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Concertina II Progress

2,429 views
Skip to first unread message

Quadibloc

unread,
Nov 8, 2023, 10:28:56 PM11/8/23
to
Some progress has been made in advancing a small step towards sanity
in the description of the Concertina II architecture described at

http://www.quadibloc.com/arch/ct17int.htm

As Mitch Alsup has rightly noted, I want to have my cake and eat it
too. I want an instruction format that is quick to fetch and decode,
like a RISC format. I want RISC-like banks of 32 registers, and I
want the CISC-like addressing modes of the IBM System/360, but with
16-bit displacements, not 12-bit displacements.

I want memory-reference instructions to still fit in 32 bits, despite
asking for so much more capacity.

So what I had done was, after squeezing as much as I could into a basic
instruction format, I provided for switching into alternate instruction
formats which made different compromises by using the block headers.

This has now been dropped. Since I managed to get the normal (unaligned)
memory-reference instruction squeezed into so much less opcode space that
I also had room for the aligned memory-reference format without compromises
in the basic instruction set, it wasn't needed to have multiple instruction
formats.

I had to change the instructions longer than 32 bits to get them in the
basic instruction format, so now they're less dense.

Block structure is still used, but now for only the two things it's
actually needed for: reserving part of a block as unused for the
pseudo-immediates, and for VLIW features (explicitly indicating
parallelism, and instruction predication).

The ISA is still tremendously complicated, since I've put room in it for
a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.

John Savard

BGB

unread,
Nov 9, 2023, 1:45:53 AM11/9/23
to
On 11/8/2023 3:33 PM, Quadibloc wrote:
> Some progress has been made in advancing a small step towards sanity
> in the description of the Concertina II architecture described at
>
> http://www.quadibloc.com/arch/ct17int.htm
>
> As Mitch Alsup has rightly noted, I want to have my cake and eat it
> too. I want an instruction format that is quick to fetch and decode,
> like a RISC format. I want RISC-like banks of 32 registers, and I
> want the CISC-like addressing modes of the IBM System/360, but with
> 16-bit displacements, not 12-bit displacements.
>

Ironically, I am getting slightly better reach on average with (scaled)
9-bit (and 10) bit displacements than RISC-V gets with 12 bits...

Say:
DWORD:
12s, Unscaled: +/- 2K
9u, 4B Scale : + 2K
10s, 4B Scale: +/- 2K (XG2)
QWORD:
12s, Unscaled: +/- 2K
9u, 8B Scale : + 4K
10s, 8B Scale: +/- 4K (XG2)

It was a pretty tight call between 10s and 10u, but 10s won out by a
slight margin mostly because the majority of structs and stack-frames
tend to be smaller than 4K (but, does create an incentive to use larger
storage formats for on-stack storage).

Though, for integer immediate instructions, RISC-V would have a slight
advantage. Where, say, roughly 9% of 3R integer immediate values miss
with the existing Imm9u/Imm9n scheme; but the sliver of "Misses with 9
bits, but would hit with 12 bits", is relatively small (most of the
"miss" cases are much larger constants).

However, a fair chunk of these "miss" cases, could be handled with a
bit-set/bit-clear instruction, say:
y=x|0x02000000;
z=x&0xFBFFFFFF;
Turning into, say:
BIS R4, 25, R6
BIC R4, 25, R7

Unclear if this case is quite common enough to justify adding these
instructions though (granted, a case could be made for them).


However, a few cases do typically need larger displacements:
PC relative, such as branches.
GBR relative, namely constant loads.


For PC relative, 20-bits is "mostly enough", but one program has hit the
20-bit limit (+/- 1MB). Recently, via a tweak, in current forms of the
ISA, the effective branch-displacement limit (for a 32-bit instruction
form) has been increased to 23 bit (+/- 8MB).
Baseline+XGPR: Unconditional BRA and BSR only.
Conditional branches still limited to 20 bits.
XG2: Also includes conditional branches.

In these cases, it was mostly because the bits that were being used to
extend the GPRs to 6 bits were N/A for their original purpose with
branch-ops, and this could be repurposed to the displacement. Main other
alternatives would have been 22 bits + alternate link register, or a
3-bit LR field; however, the cost of supporting this would have been
higher than that of reassigning them simply towards making the
displacement bigger.

Potentially a similar role could have been served by a conjoined "MOV
LR, R1 | BSR Disp" instruction (and/or allowing "MOV LR, R1" in Lane 2
as a special case for this, even if it would not otherwise be allowed
within the ISA rules). Though, would defeat the point if this encoding
foils the branch predictor.



Recently, had ended up adding some Disp11s Compare-with-Zero branches,
mostly as these branches turn out to be useful (in the face of 2-cycle
CMPxx), and 8 bits "wasn't quite enough". Say, Disp11s can cover a much
bigger if/else block or loop body (+/- 2K) than Disp8s (+/- 256B).


For GBR Relative:
The default 9-bit displacement was Byte scaled (for "reasons");
But, a 512B range isn't terribly useful;
Later forms ended up with Disp10u Scaled:
This gives 4K or 8K of range (in Baseline)
This increases to 8K and 16K in XG2.


If the compiler sorts primitive global variables by descending-usage
(and emits the top N specially, at the start of ".data"), then the
Scaled GBR cases can access a majority of the global variables (around
75-80% with a scaled 10-bit displacement).

Effectively, the remaining 20-25% or so need to be handled as one of:
Jumbo Disp33s (if Jumbo prefixes are available, most profiles);
2-op Disp25s (no jumbo, '.data'+'.bss' less than 16MB).
3-op Disp33s (else).


Though, as with the stack frames, these instructions do create an
incentive to effectively promote any small global variables to a larger
storage type (such as 'char' or 'short' to 'int'); just with implicit
sign (or zero) extensions to preserve the expected behavior of the
smaller type (though, strictly speaking, only zero-extensions would be
required by the C standard, given signed overflow is technically UB; but
there would be something "deeply wrong" with a 'char' variable being
able to hold, say, -4495213, or similar).

Though, does mean for normal variables, "just use int or similar" is
typically faster (say, because there are dedicated 32-bit sign and zero
extending forms of some of the common ALU ops, but not for 8 or 16 bit
cases).


A Disp16u case could maybe reach 256K or 512K, which could cover much of
a combined data+bss section. While in theory this could be better, to
make effective use of this would require effectively folding much of
".bss" into ".data", which is not such a good thing for the program
loader (as opposed to merely folding the top N most-used variables into
".data").

Then again, uninitialized global arrays could probably still be left in
".bss", which tend to be the main "bulking factor" for this section (as
opposed to normal variables).




> I want memory-reference instructions to still fit in 32 bits, despite
> asking for so much more capacity.
>

Yeah.

If you want a Load/Store to have two 5 bit registers and a 16-bit
displacement, only 6 bits are left in a 32-bit instruction word. This
is, not a whole lot...

For a full set of Load/Store ops, this is 4 bits;
For a set of basic ALU ops, this is another 3 bits.

So, just for Load/Store and basic ALU ops, half the encoding space is
gone...

Would it be worth it?...



> So what I had done was, after squeezing as much as I could into a basic
> instruction format, I provided for switching into alternate instruction
> formats which made different compromises by using the block headers.
>
> This has now been dropped. Since I managed to get the normal (unaligned)
> memory-reference instruction squeezed into so much less opcode space that
> I also had room for the aligned memory-reference format without compromises
> in the basic instruction set, it wasn't needed to have multiple instruction
> formats.
>
> I had to change the instructions longer than 32 bits to get them in the
> basic instruction format, so now they're less dense.
>
> Block structure is still used, but now for only the two things it's
> actually needed for: reserving part of a block as unused for the
> pseudo-immediates, and for VLIW features (explicitly indicating
> parallelism, and instruction predication).
>
> The ISA is still tremendously complicated, since I've put room in it for
> a large assortment of instructions of all kinds, but I think it's
> definitely made a significant stride towards sanity.
>

Such is a long standing issue...


I am also annoyed sometimes at how complicated my design has gotten.
Still, it is within reason, and not too far outside the scope of many
existing RISC's.

But, as noted, the reason XG2 exists as-is was sort of a compromise:
I couldn't come up with any encoding which could actually give
everything I wanted, and the "most practical" option was effectively to
dust off an idea I had originally rejected:
Having an alternate encoding which dropped 16-bit ops in favor of
reusing these bits for more GPRs.


At first glance, RISC-V seems cleaner and simpler, but this falls on its
face once one goes outside the scope of RV64IM or similar.

And, it isn't tempting when, at least from my POV, RV64 seems "less
good" than what I have already (others may disagree; but at least to me,
some parts of RISC-V's design seem to me like kind of a trash fire).

The main tempting thing the RV64 has is that, maybe, if one goes and
implements RV64GC and clones a bunch of SiFive's hardware interfaces,
then potentially one can run a mainline Linux on it.

There have apparently been some people that have gotten NOMMU Linux
working on RV32IM targets, which is possible (and, ironically, seemingly
basing these on the SuperH branch in the Linux kernel from what I had
seen...).


Seemingly, AMD/Xilinx is jumping over from MicroBlaze to an RV32
variant. But, granted, RV32 isn't too far from what MicroBlaze is
typically used for, so not really a huge stretch.

I sometimes wonder if maybe I would be better off jumping to RV, but
then I end up seeing examples where cores running at somewhat higher
clock speeds still manage to deliver relatively poor framerates in Doom.


Like, as-is, my MIPs scores are kinda weak, but I am still getting
around 30 fps in Doom at around 20-24 MIPs.

RV64IM seemingly needs significantly higher MIPs to get similar
framerates in Doom.

Say, for Doom:
BJX2 needs ~ 800k instructions / frame;
RV64IM seemingly needs nearly 2 million instructions / frame.

Not entirely sure what all is going on, but I have my suspicions.

Though, it does seem to be the inverse situation with Dhrystone.

Say:
BJX2: around 1.3 DMIPS per BJX2 instruction;
RV64: around 3.8 DMIPS per RV64 instruction.

Though, I can note that there seems to be "something weird" with
Dhrystone and GCC (in multiple scenarios, GCC gives Dhrystone scores
that are significantly above what could be "reasonably expected", or
which agree with the scores given by other compilers, seemingly as-if it
is optimizing away a big chunk of the benchmark...).

But, these results don't typically extend to other programs (where
scores are typically much closer together).


Actually, I have noted that if comparing BGBCC with MSVC and BJX2 with
my Ryzen, performance relations seem to scale pretty closer to linearly
relative to clock-speed, albeit with some outliers.

There are cases where deviation has been noted:
Speed differences for TKRA-GL's software rasterizer backend are smaller
than the difference in clock-speed (74x clock-speed delta; 20x fill-rate
delta);
And cases where it is bigger: The performance delta for things like LZ4
decompression or some of my image codecs is somewhat larger than the
clock-speed delta (say: 74x clock-speed delta, 115x performance delta, *1).


*1: Though, LZ4 still operates near memcpy() speed in both cases; issue
is mostly that, relative to MHz, my BJX2 core has comparably slower
memory access.

Albeit somehow, this trend reverses for my early 2000s laptop, which has
slower RAM access. However, the SO-DIMM is 4x the width (64b vs 16b),
and 133MHz vs 50MHz; and this leads to a theoretical 10.64x ratio, which
isn't too far off from the observed memcpy() performance of the laptop.

So, laptop has 10.64x faster RAM, relative to 28x more MHz.


Wheres, say, my Ryzen has 2.64x more MHz (3.7 vs 1.4), but around 40x
more memory bandwidth (12.7x for single-thread memcpy).



Well, and if I did jump over to RV64, it would renderer much of what I
am doing entirely moot.

I *could* do a dedicated RV64 core, but could unlikely make it "notable"
enough to be worthwhile.

So, it seems like my options are either:
Continue on doing stuff mostly as is;
Drop it and probably go off to doing something else entirely.

...




But, don't have much else better to be doing, considering the typically
"meh" response to most of my 3D engine attempts. And my general
lackluster skills towards most types of "creative" endeavors (I suspect
"affective alexithymia" probably doesn't help too much for artistic
expression).

Well, and I have also recently noted other oddities, for example:
It seems I may have "reverse slope hearing loss", and my hearing is
seemingly notably poor for sounds much lower than about 1.5 or 2kHz
(lower-frequency sine waves are nearly inaudible, but I can still hear
square/triangle/sawtooth waves well; most of what I perceive as
low-frequency sounds seemingly being based on higher-frequency harmonics
of those sounds).

So, say:
2kHz..4kHz, loud, heard easily;
4kHz..8kHz, also heard readily;
8..15kHz, fades away and disappears.
But, OTOH, for sine waves:
1kHz: much quieter than 2kHz
500Hz: fairly mild at full volume
250Hz: relatively quiet
125Hz: barely audible.


But, for sounds much under around 200Hz, I can feel the vibrations, and
can associate these with sound (but, this effect is not localized to
ears, also works with hands and similar; this effect seems strongest at
around 50-100 Hz, but has a lower range of around 6-8Hz, below this
point, feeling becomes less sensitive to it, but visual perception can
take over at this point).


I can take audio and apply a fairly aggressive 2kHz high-pass filter
(say, -48db per octave, applied several times), and for the most part it
doesn't sound that much different, though does sound a little more
tinny. This "tinny" effect is reduced with a 1kHz high-pass filter.

Most of what I had perceived as low-frequency sounds are still present
even after the filtering (and while entirely absent in a spectrum plot).
Zooming in generally shows patterns of higher frequency vibrations
following similar patterns to the low-frequency vibrations, which
seemingly I perceive "as" the low-frequency vibration.


And, in all this, I hadn't noticed that anything was amiss until looking
into it for other reasons.



I am left to wonder is some of this could be related to my preference
for the sound of ADPCM compression over that of MP3 at lower quality
levels (low bitrate MP3 sounds particularly awful, whereas ADPCM tends
to fare better; but seemingly other people disagree).


Does possibly explain some other past difficulties:
I can make a noise and hear the walls within a room;
But, trying to hit a metal tank to determine how much sand was in the
tank by hearing, was quite a bit more difficult (best I could do was hit
the tank, and then try to hear what parts of the tank had reduced echo;
but results were pretty mixed as the sand level did not significantly
change the echoes).

Apparently, it turns out, people were listening for "thud" vs "not
thud", but like, I couldn't really hear this part, and wasn't even
really aware there should be a "thud" (or even really what a "thud"
sounds like apart from the effects of, say, something hitting a chunk of
wood; hitting a sand-filled steel tank with a rubber mallet was nearly
silent, but, knuckles or tapping it with a screwdriver was easier to
hear, ...).


Well, also can't really understand what anyone is saying over the phone
(as the phone reduces everything to difficult to understand muffled noises).

Or, like the sound-effects in Wolfenstein 3D being theoretically voice
clips saying stuff, but are more things like "aaaa uunn" or "aaaauuuu"
or "uu aa uu" or similar owing to the poor audio quality.

Well, and my past failures to achieve any kind of intelligibility in
past experiments messing with formant synthesis.

And some experiments with vocoder like designs, noting that I could
seemingly discard pretty much everything much below 500Hz or 1kHz
without much ill effect; but theoretically there is "relevant stuff" in
these frequency ranges. Didn't really think of much at the time (it
seemed like all of this was a "based frequency" where the combined
amplitude of everything could be averaged together and treated like a
single channel).

Had noted that, one thing that did sort of work, was, say:
Split the audio into 32 frequency bands;
Pick the top 2 or 3 bands, ignoring low-frequency or adjacent bands;
Say, anything below 1kHz is ignored.
Record the band number and relative volume.

Then, regenerate waveforms at each of these bands with the measured
volume (along with alternate versions spread across different octaves;
it worked better if higher power-of-2 frequencies were also synthesized,
albeit at lower intensities). Get back "mostly intelligible" speech.

IIRC, had mostly used 32 bands spread across 2 octaves (say, 1-2 kHz and
2-4kHz, or 2-4 kHz and 4-8 kHz).
Can also mix in sounds from the same relative position in other octaves.

Seemed to have best results with mostly evenly-spread frequency bands.


...


Thomas Koenig

unread,
Nov 9, 2023, 1:50:41 PM11/9/23
to
Quadibloc <quad...@servername.invalid> schrieb:

> As Mitch Alsup has rightly noted, I want to have my cake and eat it
> too. I want an instruction format that is quick to fetch and decode,
> like a RISC format. I want RISC-like banks of 32 registers, and I
> want the CISC-like addressing modes of the IBM System/360, but with
> 16-bit displacements, not 12-bit displacements.

So, r1 = r2 + r3 + offset.

Three registers is 15 bits plus a 16-bit offset, which gives you
31 bits. You're left with one bit of opcode, one for load and
one for store.

The /360 had 12 bits for three registers plus 12 bits of offset, so
24 bits left eight bits for the opcode (the RX format).

So, if you want to do this kind of thing, why not go for a full 32-bit
offset in a second 32-bit word?

[...]

> The ISA is still tremendously complicated, since I've put room in it for
> a large assortment of instructions of all kinds, but I think it's
> definitely made a significant stride towards sanity.

Have you ever written an assembler for your ISA?

BGB-Alt

unread,
Nov 9, 2023, 4:36:18 PM11/9/23
to
On 11/9/2023 12:50 PM, Thomas Koenig wrote:
> Quadibloc <quad...@servername.invalid> schrieb:
>
>> As Mitch Alsup has rightly noted, I want to have my cake and eat it
>> too. I want an instruction format that is quick to fetch and decode,
>> like a RISC format. I want RISC-like banks of 32 registers, and I
>> want the CISC-like addressing modes of the IBM System/360, but with
>> 16-bit displacements, not 12-bit displacements.
>
> So, r1 = r2 + r3 + offset.
>
> Three registers is 15 bits plus a 16-bit offset, which gives you
> 31 bits. You're left with one bit of opcode, one for load and
> one for store.
>

Oh, that is even worse than I understood it as, namely:
LDx Rd, (Rs, Disp16)
...

But, yeah, 1 bit of opcode clearly wouldn't work...


> The /360 had 12 bits for three registers plus 12 bits of offset, so
> 24 bits left eight bits for the opcode (the RX format).
>
> So, if you want to do this kind of thing, why not go for a full 32-bit
> offset in a second 32-bit word?
>

Originally, I had turned any displacements that didn't fit into 9 bits
into a 2-op sequence:
MOV Imm25s, R0
MOV.x (Rb, R0), Rn

Actually, worse yet, the first form of BJX2 only had 5-bit Load/Store
displacements, but it didn't take long to realize that 5 bits wasn't
really enough (say, when roughly 2/3 of the load and store operations
can't fit in the displacement).


But, now, there are Jumbo-encodings, which can encode a full 33-bit
displacement in a 64-bit encoding. Not everything is perfect though,
mostly because these encodings are bigger and can't be used in a bundle.

But, still "less bad" in this sense than my original 48-bit encodings,
where "for reasons", these couldn't co-exist with bundles in the same
code block.

Despite the loss of 48-bit ops though:
The jumbo encodings give larger displacements (33s vs 24u or 17s);
They reuse the existing 32-bit decoders, rather than needing a dedicated
48-bit decoder.


But, yeah, "use another instruction word" if one needs a larger
displacement, is mostly the option that I would probably recommend.


At first, the 5-bit encodings went away, but later came back as a zombie
of sorts (cases emerged where their existence was still valuable).

But, then it later came up to a tradeoff (with the design of XG2):
Do I expand the Disp9u to Disp10u, and then keep with the XGPR encoding
of using the Disp5u encodings to encode a Disp6s case (for a small range
of negative displacements), or expand to Disp9u to Disp10s?...

In this case, Disp10s won out by a small margin, as I needed non-trivial
negative displacements at least slightly more often than I needed 8K for
structs and stack frames and similar.


But, for most things, a 16-bit displacement would be a waste...
If I were going to go the route of using a signed 12-bit displacement
(like RISC-V), would probably still keep it scaled though, as 8K/16K is
still more useful than 2K.


Branch displacements are typically still hard-wired as 2 though, partly
as the ISA started out with 16-bit ops, and switching XG2 over to 4-byte
scale would have broken its symmetry with the Baseline ISA.


Though, could pull a cheap trick and repurpose the LSB of branch ops in
XG2, given as-is, it is effectively "Must Be Zero" (all instructions
have a 32-bit alignment in this mode, and branches to an odd address are
not allowed).

So, the idea of a BSR that uses R1 as an alternate Link-Register is
still not (entirely) dead (while at the same time allowing for the
'.text' section to be expanded to 8MB).


There are 64-bit Disp33s and Abs48 branch encodings, but, yeah, they
have costs:
They are 64-bit vs 32-bit, thus, bigger;
Are ignored by the branch predictor, thus, slower;
The Abs48 case is not PC relative
Using it within a program requires a base reloc;
Is generally useful for DLL imports and special cases though (*1).

*1: Its existence is mostly as an alternative in these cases to a more
expensive option:
MOV Addr64, R1
JMP R1
Which needs 128-bits, and is also ignored by the branch predictor.


> [...]
>
>> The ISA is still tremendously complicated, since I've put room in it for
>> a large assortment of instructions of all kinds, but I think it's
>> definitely made a significant stride towards sanity.
>
> Have you ever written an assembler for your ISA?

Yeah, whether someone can write an assembler, or disassembler/emulator,
and not drive themselves insane in the attempt, is possibly a test of
"sanity".

Granted, still not foolproof, as it isn't that bad to write an
assembler/disassembler for x86 either, but trying to decode it in
hardware would be nightmarish.

Best guess I can have would be a "preclassify" stage:
If this is an opcode byte, how long will it be, and will a Mod/RM
follow, ...?
If this is a Mod/RM byte, how many bytes will this add.

Then in theory, one can figure instruction length like:
Fetch OpLen for IP;
Fetch Mod/RM len for IP+OpLen if Mod/RM flag is set;
Add OpLen+ModRmLen.
Add an extra 2/4 bytes if an Immed is present for this opcode.

Nicer to not bother.


For my 75 MHz experiment, did end up adding a similar sort of
"preclassify" logic to deal with instruction-lengths though, at the cost
that now L1 I$ cache-lines are specific to the operating mode in which
they were fetched (which now needs to be checked along with the address
and similar).

Mostly all this is a case of "looking up 4 bits of tag metadata" being
less latency than "feed 9 bits of instruction bits through some LUTs"
(or 12 bits if RISC-V decoding is enabled). There is still some latency
due to MUX'ing and similar, but this part is unavoidable.

So, former case:
8 bits: Classify BJX2 instruction length;
1 bit: Specify Baseline or XG2.
Latter case:
8 bits: Classify BJX2 instruction length;
2 bits: Classify RISC-V instruction length (16/32)
2 bits: Specify Baseline, XG2, RISC-V, or XG2RV.

Which map to 4 bits (IIRC):
(0): 16-bit
(1): (WEX && WxE) || Jumbo
(2): WEX
(3): Jumbo


As-is, after MUX'ing, this can effectively turn op-len determination
into a 4 or 6 bit lookup, say (bits tag 1:0 for two adjacent 32-bit words):
00zz: 32-bit
01zz: 16-bit
1000: 64-bit
1001: 48-bit (unused)
1010: 96-bit (*)
1011: Invalid
11zz: Invalid

*: Here, we just assume that the 3'rd instruction word is 00.
Would actually need to check this if either 4-wide bundles or 80-bit
encodings were "actually a thing".

Where, handling both XG2 and WXE (WEX Enable) in the preclassify step
greatly simplifies the logic during instruction fetch.

This could, in premise, be reduced further in an "XG2 only" core, or to
a lesser extent by eliminating the original XGPR scheme. These are not
currently planned though (say, the first-stage lookup width could be
reduced from 8 to 5 or 7 bits).

...


Quadibloc

unread,
Nov 9, 2023, 4:38:35 PM11/9/23
to
On Thu, 09 Nov 2023 18:50:37 +0000, Thomas Koenig wrote:

> So, r1 = r2 + r3 + offset.
>
> Three registers is 15 bits plus a 16-bit offset, which gives you 31
> bits. You're left with one bit of opcode, one for load and one for
> store.

Yes, and obviously that isn't enough. So I do have to make some
compromises.

The offset is 16 bits, because the 68000 (and the 8086, and others) had 16
bit offsets!

But the base and index registers are each specified by only 3 bits - only
the destination register gets a 5-bit field.

I need 5 bits for the opcode. That lets me have load and store for four
floating-point types, load, store, unsigned load, and insert for four
integer types (the largest one only uses load and store).

So it is doable! 5 plus 5 plus 3 plus 3 equals 16, so I have 16 bits left
for the offset.

But that leaves only 1/4 of the opcode space. Which would be fine for a
conventional RISC design, as that's plenty for the operate instructions.
But I needed to reserve _half_ the opcode space, because I needed another
1/4 of the opcode space for putting two 16-bit instructions in a 32-bit
word for more compact code.

That led me to look for compromises... and I found some that would not
overly impair the effectiveness of the memory reference instructions,
which I discussed previously. I ended up using _both_ of two alternatives
each of which alone would have given me the needed savings in opcode
space... that way, the compromised memory-reference instructions could be
accompanied by another complete set of memory-reference instructions with
_no_ compromise... except for only being able to specify aligned operands.

> The /360 had 12 bits for three registers plus 12 bits of offset, so 24
> bits left eight bits for the opcode (the RX format).

Oh, yes, I remember it well.

> So, if you want to do this kind of thing, why not go for a full 32-bit
> offset in a second 32-bit word?

Because the 360 only took 32 bits for a memory-reference instruction, so
using 32 bits for one is sinfully wasteful!

I want to "have my cake and eat it too" - to have a computer that's just
as good as a Power PC or a 68000 or a System/360, even though they have
different, incompatible, strengths that conflict with a computer being
able to be good at what each of them is good at simultaneously.

John Savard

Quadibloc

unread,
Nov 9, 2023, 4:42:48 PM11/9/23
to
On Thu, 09 Nov 2023 21:38:31 +0000, Quadibloc wrote:

> I want to "have my cake and eat it too" - to have a computer that's just
> as good as a Power PC or a 68000 or a System/360, even though they have
> different, incompatible, strengths that conflict with a computer being
> able to be good at what each of them is good at simultaneously.

Actually, it's worse than that, since I also want the virtues of processors
like the TMS320C2000 or the Itanium.

John Savard

Quadibloc

unread,
Nov 9, 2023, 4:51:36 PM11/9/23
to
On Thu, 09 Nov 2023 15:36:12 -0600, BGB-Alt wrote:
> On 11/9/2023 12:50 PM, Thomas Koenig wrote:

>> So, r1 = r2 + r3 + offset.
>>
>> Three registers is 15 bits plus a 16-bit offset, which gives you 31
>> bits. You're left with one bit of opcode, one for load and one for
>> store.
>>
>>
> Oh, that is even worse than I understood it as, namely:
> LDx Rd, (Rs, Disp16)
> ...
>
> But, yeah, 1 bit of opcode clearly wouldn't work...

And indeed, he is correct, that is what I'm trying to do.

But I easily solve _most_ of the problem.

I just use 3 bits for the index register and the base register.

The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.

16-bit register-to-register instructions use eight bits to specify their
source and destination registers, so both registers must be from the same
group of eight registers.

This lends itself to writing code where four distinct threads are
interleaved, helping pipelining in implementations too cheap to have
out-of-order executiion.

The index register can be one of registers 1 to 7 (0 means no indexing).

The base register can be one of registers 25 to 31. (24, or a 0 in the
three-bit base register field, indicates a special addressing mode.)

This sort of is reminiscent of System/360 coding conventions.

The special addressing modes do stuff like using registers 17 to 23 as
base registers with a 12 bit displacement, so that additional short
segments can be accessed.

As I noted, shaving off two bits each from two fields gives me four more
bits, and five bits is exactly what I need for the opcode field.

Unfortunately, I needed one more bit, because I also wanted 16-bit
instructions, and they take up too much space. That led me... to some
interesting gyrations, but I finally found a compromise that was
acceptable to me for saving those bits, so acceptable that I could drop
the option of using the block header to switch to using "full" instructions
instead. Finally!

John Savard

Quadibloc

unread,
Nov 9, 2023, 5:11:46 PM11/9/23
to
And don't forget the Cray-I.

So the idea is to have *one* ISA that will serve for...

embedded microcontrollers,
data-base servers,
desktop workstations, and
HPC supercomputers.

Of course, these different tasks will require different implementations,
which focus on doing parts of the ISA well.

John Savard

BGB-Alt

unread,
Nov 9, 2023, 6:49:09 PM11/9/23
to
On 11/9/2023 3:51 PM, Quadibloc wrote:
> On Thu, 09 Nov 2023 15:36:12 -0600, BGB-Alt wrote:
>> On 11/9/2023 12:50 PM, Thomas Koenig wrote:
>
>>> So, r1 = r2 + r3 + offset.
>>>
>>> Three registers is 15 bits plus a 16-bit offset, which gives you 31
>>> bits. You're left with one bit of opcode, one for load and one for
>>> store.
>>>
>>>
>> Oh, that is even worse than I understood it as, namely:
>> LDx Rd, (Rs, Disp16)
>> ...
>>
>> But, yeah, 1 bit of opcode clearly wouldn't work...
>
> And indeed, he is correct, that is what I'm trying to do.
>
> But I easily solve _most_ of the problem.
>
> I just use 3 bits for the index register and the base register.
>
> The 32 general registers aren't _quite_ general. They're divided into
> four groups of eight.
>

Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

Unless, maybe, registers were being treated like a stack, but even then,
this is still gonna suck.

Much preferable for a compiler to have a flat space of 32 or 64
registers. Having 16 sorta works, but does still add a bit to spill and
fill.


Theoretically, 32 registers should be "pretty good", but I ended up with
64 partly due to arguable weakness in my compilers' register allocation.

Say, 64 makes it possible to static assign most of the variables in most
of the functions, which avoids the need for spill and fill; at least
with a register allocator that isn't smart enough to locally assign
registers across basic-block boundaries).

I am not sure if a more clever compiler (such as GCC) could also find
ways to make effective use of 64 GPRs.


I guess, IA-64 did have 128 registers in banks of 32. Not sure how well
this worked.


> 16-bit register-to-register instructions use eight bits to specify their
> source and destination registers, so both registers must be from the same
> group of eight registers.
>

When I added R32..R63, I ended up not bothering adding any way to access
them from 16-bit ops.

So:
R0..R15: Generally accessible for all of 16-bit land;
R16..R31: Accessible from a limited subset of 16-bit operations.
R32..R63: Inaccessible from 16-bit land.
Only accessible for an ISA subset for 32-bit ops in XGPR.

Things are more orthogonal in XG2:
No 16-bit ops;
All of the 32-bit ops can access R0..R63 in the same way.


> This lends itself to writing code where four distinct threads are
> interleaved, helping pipelining in implementations too cheap to have
> out-of-order executiion.
>

Considered variations on this in my case as well, just with static
control flow.

However, BGBCC is nowhere near clever enough to pull this off...

Best that can be managed is doing this sort of thing manually (this is
sort of how "functions with 100+ local variables" are born).

In theory, a compiler could infer when blocks of code or functions are
not sequentially dependent and inline everything and schedule it in
parallel, but alas, this sort of thing requires a bit of cleverness that
is hard to pull off.


> The index register can be one of registers 1 to 7 (0 means no indexing).
>
> The base register can be one of registers 25 to 31. (24, or a 0 in the
> three-bit base register field, indicates a special addressing mode.)
>
> This sort of is reminiscent of System/360 coding conventions.
>

OK.


> The special addressing modes do stuff like using registers 17 to 23 as
> base registers with a 12 bit displacement, so that additional short
> segments can be accessed.
>
> As I noted, shaving off two bits each from two fields gives me four more
> bits, and five bits is exactly what I need for the opcode field.
>
> Unfortunately, I needed one more bit, because I also wanted 16-bit
> instructions, and they take up too much space. That led me... to some
> interesting gyrations, but I finally found a compromise that was
> acceptable to me for saving those bits, so acceptable that I could drop
> the option of using the block header to switch to using "full" instructions
> instead. Finally!
>

A more straightforward encoding would make things, more straightforward...


Main debates I think are, say:
Whether to start with the MSB of each word (what I had often done);
Or, start from the LSB (like RISC-V);
Whether 5 or 6 bit register fields;
How much bits for immediate and opcode fields;
...

Bundling and predication may eat a few bits, say:
00: Scalar
01: Bundle
10/11: If-True / If-False

In my case, this did leave an ugly hack case to support conditional ops
in bundles. Namely, the instruction to "Load 24 bits into R0" has
different interpretations in each case (Scalar: Load 24 bits into R0;
Bundle: Jumbo Prefix; If-True/If-False, repeat a different instruction
block, but understood as both conditional and bundled).

This could be fully orthogonal with 3 bits, but it seems, this is a big ask:
000, Unconditional, Scalar
001, Unconditional, Bundle
010, Special, Scalar (Eg: Large constant load or Branch)
011, Special, Bundle (Eg: Jumbo Prefix)
100, If-True, Scalar
101, If-True, Bundle
110, If-False, Scalar
111, If-False, Bundle


This leads to a lopsided encoding though, and it seems like things only
really fit together nicely with a limited combination of sizes.

Say, for an immediate field:
24+ 9 => 33s
24+24+16 => 64
This is almost magic...

Though:
26+ 7 => 33s
26+26+12 => 64
Could also work.


But, does end up with an ISA layout where immediate values are mostly 7u
or 7n, which is not nearly as attractive as 9u and 9n.

Say, for Load/Store displacement hit (rough approximations, from memory):
5u: 35%
7u: 65%
9u: 90%
...


All turns into a bit of an annoying numbers game sometimes...


But, this ended up as part of why I ended up with XG2, which didn't give
me everything I wanted, and the encodings of some things does have more
"dog chew" than I would like (I would have preferred if everything were
nice contiguous fields, rather than the bits for each register field
being scattered across the instruction word).

But, the numbers added up in a way that worked better than most of the
alternatives I could come up with (and happened to also be the "least
effort" implementation path).


Granted, I still keep half expecting people to be like "Dude, just jump
onto the RISC-V wagon...".

Or, failing this, at least implement enough of RISC-V to be able to run
Linux on it (but, this would require significant architectural changes;
being able to run a "stock" RV64GC Linux build would effectively require
partially cloning a bunch of SiFive's architectural choices or similar;
which is not something I would be happy with).

But, otherwise, pretty much any other option in this area would still
mean a porting effort...


Well, and the on/off consideration of trying to port a BSD variant, as
BSD seemed like potentially less effort (there is far less implicit
assumptions of GNU related stuff being used).

...

John Dallman

unread,
Nov 9, 2023, 7:29:24 PM11/9/23
to
In article <uijjoj$2dc2i$1...@dont-email.me>, quad...@servername.invalid
(Quadibloc) wrote:

> Actually, it's worse than that, since I also want the virtues of
> processors like the TMS320C2000 or the Itanium.

What do you consider the virtues of Itanium to be?

No company ever seems to have taken it up on technical grounds, only as a
result of Intel and HP persuading commercial managers that it would
become widely used owing to their market power.

John

MitchAlsup

unread,
Nov 9, 2023, 8:11:23 PM11/9/23
to
Quadibloc wrote:

> Some progress has been made in advancing a small step towards sanity
> in the description of the Concertina II architecture described at

> http://www.quadibloc.com/arch/ct17int.htm

> As Mitch Alsup has rightly noted, I want to have my cake and eat it
> too. I want an instruction format that is quick to fetch and decode,
> like a RISC format. I want RISC-like banks of 32 registers, and I
> want the CISC-like addressing modes of the IBM System/360, but with
> 16-bit displacements, not 12-bit displacements.
<
My 66000 has all of this.
<
> I want memory-reference instructions to still fit in 32 bits, despite
> asking for so much more capacity.
<
The simple/easy ones definitely, the ones with longer displacements no.
<
> So what I had done was, after squeezing as much as I could into a basic
> instruction format, I provided for switching into alternate instruction
> formats which made different compromises by using the block headers.
<
Block headers are simply consuming entropy.
<
> This has now been dropped. Since I managed to get the normal (unaligned)
> memory-reference instruction squeezed into so much less opcode space that
> I also had room for the aligned memory-reference format without compromises
> in the basic instruction set, it wasn't needed to have multiple instruction
> formats.
<
I never had any aligned memory references. The HW overhead to "fix" the
problem is so small as to be compelling.
<
> I had to change the instructions longer than 32 bits to get them in the
> basic instruction format, so now they're less dense.

> Block structure is still used, but now for only the two things it's
> actually needed for: reserving part of a block as unused for the
> pseudo-immediates, and for VLIW features (explicitly indicating
> parallelism, and instruction predication).

> The ISA is still tremendously complicated, since I've put room in it for
> a large assortment of instructions of all kinds, but I think it's
> definitely made a significant stride towards sanity.
<
Yet, mine remains simple and compact.
<
> John Savard

BGB

unread,
Nov 9, 2023, 11:22:14 PM11/9/23
to
On 11/9/2023 7:11 PM, MitchAlsup wrote:
> Quadibloc wrote:
>

Good to see you are back on here...


>> Some progress has been made in advancing a small step towards sanity
>> in the description of the Concertina II architecture described at
>
>> http://www.quadibloc.com/arch/ct17int.htm
>
>> As Mitch Alsup has rightly noted, I want to have my cake and eat it
>> too. I want an instruction format that is quick to fetch and decode,
>> like a RISC format. I want RISC-like banks of 32 registers, and I
>> want the CISC-like addressing modes of the IBM System/360, but with
>> 16-bit displacements, not 12-bit displacements.
> <
> My 66000 has all of this.
> <
>> I want memory-reference instructions to still fit in 32 bits, despite
>> asking for so much more capacity.
> <
> The simple/easy ones definitely, the ones with longer displacements no.
> <

Yes.

As noted a few times, as I see it, 9 .. 12 is sufficient.
Much less than 9 is "not enough", much more than 12 is wasting entropy,
at least for 32-bit encodings.


12u-scaled would be "pretty good", say, being able to handle 32K for
QWORD ops.


>> So what I had done was, after squeezing as much as I could into a basic
>> instruction format, I provided for switching into alternate instruction
>> formats which made different compromises by using the block headers.
> <
> Block headers are simply consuming entropy.
> <

Also yes.


>> This has now been dropped. Since I managed to get the normal (unaligned)
>> memory-reference instruction squeezed into so much less opcode space that
>> I also had room for the aligned memory-reference format without
>> compromises
>> in the basic instruction set, it wasn't needed to have multiple
>> instruction
>> formats.
> <
> I never had any aligned memory references. The HW overhead to "fix" the
> problem is so small as to be compelling.
> <

In my case, it is only for 128-bit load/store operations, which require
64-bit alignment.

Well, and an esoteric edge case:
if((PC&0xE)==0xE)
You can't use a 96-bit encoding, and will need to insert a NOP if one
needs to do so.


One can argue that aligned-only allows for a cheaper L1 D$, but also
"sucks pretty bad" for some tasks:
Fast memcpy;
LZ decompression;
Huffman;
...


>> I had to change the instructions longer than 32 bits to get them in
>> the basic instruction format, so now they're less dense.
>
>> Block structure is still used, but now for only the two things it's
>> actually needed for: reserving part of a block as unused for the
>> pseudo-immediates, and for VLIW features (explicitly indicating
>> parallelism, and instruction predication).
>
>> The ISA is still tremendously complicated, since I've put room in it for
>> a large assortment of instructions of all kinds, but I think it's
>> definitely made a significant stride towards sanity.
> <
> Yet, mine remains simple and compact.
> <

Mostly similar.
Though, I guess some people could debate this in my case.


Granted, I specify the entire ISA in a single location, rather than
spreading it across a bunch of different documents (as was the case with
RISC-V).

Well, and where there is a lot that is left up to the specific hardware
implementations in terms of stuff that one would need to "actually have
an OS run on it", ...


>> John Savard

Quadibloc

unread,
Nov 9, 2023, 11:31:49 PM11/9/23
to
On Fri, 10 Nov 2023 00:29:00 +0000, John Dallman wrote:

> In article <uijjoj$2dc2i$1...@dont-email.me>, quad...@servername.invalid
> (Quadibloc) wrote:
>
>> Actually, it's worse than that, since I also want the virtues of
>> processors like the TMS320C2000 or the Itanium.
>
> What do you consider the virtues of Itanium to be?

Well, I think that superscalar operation of microprocessors is a good
thing. Explicitly indicating which instructions may execute in parallel
is one way to facilitate that. Even if the Itanium was an unsuccessful
implementation of that principle.

John Savard

Quadibloc

unread,
Nov 9, 2023, 11:37:21 PM11/9/23
to
On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
> On 11/9/2023 3:51 PM, Quadibloc wrote:

>> The 32 general registers aren't _quite_ general. They're divided into
>> four groups of eight.

> Errm, splitting up registers like this is likely to hurt far more than
> anything that 16-bit displacements are likely to gain.

For 32-bit instructions, the only implication is that the first few
integer registers would be used as index registers, and the last few
would be used as base registers, which is likely to be true in any
case.

It's only in the 16-bit operate instructions that this splitting of
registers is actively present as a constraint. It is needed to make
16-bit operate instructions possible.

So the cure is that if a compiler finds this too much trouble, it
doesn't have to use the 16-bit instructions.

Of course, if compilers can't use them, that raises the question of
whether 16-bit instructions are worth having. Without them, the
complications that I needed to be happy about my memory-reference
instructions could have been entirely avoided.

John Savard

Quadibloc

unread,
Nov 9, 2023, 11:43:19 PM11/9/23
to
On Fri, 10 Nov 2023 01:11:13 +0000, MitchAlsup wrote:

> I never had any aligned memory references. The HW overhead to "fix" the
> problem is so small as to be compelling.

Since I have a complete set of memory-reference instructions for which
unaligned memory-reference instructions are supported, the problem isn't
that I think unaligned fetches and stores take too many gates.

Rather, being able to only specify aligned accesses saves *opcode space*,
which lets me fit in one complete set of memory-reference instructions that
can use all the base registers, all the index registers, and always use all
the registers as destination registers.

While the unaligned-capable instructions, that offer also important
additional addressing modes, had to have certain restrictions to fit in.

So they use six out of the seven index registers, they can use only half
the registers as destination registers on indexed accesses, and they use
four out of the seven base registers.

Having 16-bit instructions for the possibility of more compact code meant
that I had to have at least one of the two restrictions noted above -
having both restrictions meant that I could offer the alternative of
aligned-only instructions with neither restriction, which may be far less
painful for some.

John Savard

BGB

unread,
Nov 10, 2023, 1:49:11 AM11/10/23
to
On 11/9/2023 10:37 PM, Quadibloc wrote:
> On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
>> On 11/9/2023 3:51 PM, Quadibloc wrote:
>
>>> The 32 general registers aren't _quite_ general. They're divided into
>>> four groups of eight.
>
>> Errm, splitting up registers like this is likely to hurt far more than
>> anything that 16-bit displacements are likely to gain.
>
> For 32-bit instructions, the only implication is that the first few
> integer registers would be used as index registers, and the last few
> would be used as base registers, which is likely to be true in any
> case.
>
> It's only in the 16-bit operate instructions that this splitting of
> registers is actively present as a constraint. It is needed to make
> 16-bit operate instructions possible.
>

FWIW: I went with 16-bit ops with 4-bit register fields (with a small
subset with 5-bit register fields).

Granted, layout was different than SH:
zzzz-nnnn-mmmm-zzzz //typical SH layout
zzzz-zzzz-nnnn-mmmm //typical BJX2 layout

Where, as noted, typical 32-bit layout in my case is:
111p-ZwZZ-nnnn-mmmm ZZZZ-qnmo-oooo-ZZZZ
And, in XG2:
NMOP-ZwZZ-nnnn-mmmm ZZZZ-qnmo-oooo-ZZZZ




I guess, a "minor" reorganization might yield, say:
PwZZ-ZZZZ-ZZnn-nnnn-mmmm-mmoo-oooo-ZZZZ (3R)
PwZZ-ZZZZ-ZZnn-nnnn-mmmm-mmZZ-ZZZZ-ZZZZ (2R)
PwZZ-ZZZZ-ZZnn-nnnn-mmmm-mmii-iiii-iiii (3RI, Imm10)
PwZZ-ZZZZ-ZZnn-nnnn-ZZZZ-ZZii-iiii-iiii (2RI, Imm10)
PwZZ-ZZZZ-ZZnn-nnnn-iiii-iiii-iiii-iiii (2RI, Imm16)
PwZZ-ZZZZ-iiii-iiii-iiii-iiii-iiii-iiii (Imm24)

Which seems like actually a relatively nice layout thus far...


Possibly, going further:
Pw00-ZZZZ-ZZnn-nnnn-mmmm-mmoo-oooo-ZZZZ (3R Space)
Pw00-1111-ZZnn-nnnn-mmmm-mmZZ-ZZZZ-ZZZZ (2R Space)

Pw01-ZZZZ-ZZnn-nnnn-mmmm-mmii-iiii-iiii (Ld/St Disp10)

Pw10-0ZZZ-ZZnn-nnnn-mmmm-mmii-iiii-iiii (3RI Imm10, ALU Block)
Pw10-1ZZZ-ZZnn-nnnn-ZZZZ-ZZii-iiii-iiii (2RI Imm10)

Pw11-0ZZZ-ZZnn-nnnn-iiii-iiii-iiii-iiii (2RI, Imm16)

Pw11-1110-iiii-iiii-iiii-iiii-iiii-iiii BRA Disp24s (+/- 32MB)
Pw11-1111-iiii-iiii-iiii-iiii-iiii-iiii BSR Disp24s (+/- 32MB)

1111-111Z-iiii-iiii-iiii-iiii-iiii-iiii Jumbo


Though, might almost make sense for PrWEX to be N/E, as the PrWEX blocks
seem to be infrequently used in BJX2 (basically, for predicated
instructions that exist as part of an instruction bundle).

Say:
Scalar: 77.3%
WEX : 8.9%
Pred : 13.5%
PrWEX : 0.3%


> So the cure is that if a compiler finds this too much trouble, it
> doesn't have to use the 16-bit instructions.
>
> Of course, if compilers can't use them, that raises the question of
> whether 16-bit instructions are worth having. Without them, the
> complications that I needed to be happy about my memory-reference
> instructions could have been entirely avoided.
>

For performance optimized cases, I am starting to suspect 16-bit ops are
not worth it.

For size optimization, they make sense; but size optimization also means
mostly confining register allocation to R0..R15 in my case, with
heuristics for when to enable additional registers, where enabling the
higher registers effectively hinders the use of 16-bit instructions.


The other option I have found is that, rather than optimizing for
smaller instructions (as in an ISA with 16 bit instructions), one can
instead optimize for doing stuff in as few instructions as it is
reasonable to do so, which in turn further goes against the use of
16-bit instructions.


And, thus far, I am ending up building a lot of my programs in XG2 mode
despite the slightly worse code density (leaving the main "hold outs"
for the Baseline encoding mostly being the kernel and Boot ROM).

The kernel could go over to XG2 without too much issue, mostly leaving
the Boot ROM. Switching over the ROM would require some functional
tweaks (coming out of reset in a different mode), as well as probably
either increasing the size of the ROM or removing some stuff (building
the Boot ROM as-is in XG2 mode would exceed the current 32K limit).


Granted, the main things the ROM contains is a bunch of boot-time sanity
check stuff, a RAM counter, FAT32 driver, and stuff to init the graphics
module (such as a Boot-time ASCII font, *).

*: Though, this font saves some space by only encoding the ASCII-range
characters, and packing the character glyphs into 5*6 pixels (allowing
32-bits, rather than the 64-bits needed for an 8x8 glyph). This won out
aesthetically over using a 7-segment or 14-segment font (as well as it
taking more complex logic to unpack 7 or 14 segment into an 8x8
character cell).

Where, say, unlike a CGA or VGA, the initial font is not held in a
hardware ROM. There was originally, but it was cheaper to manage the
font in software, effectively using the VRAM as a plain color-cell
display in text mode.

...

Scott Lurndal

unread,
Nov 10, 2023, 9:51:49 AM11/10/23
to
Quadibloc <quad...@servername.invalid> writes:
>On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
>> On 11/9/2023 3:51 PM, Quadibloc wrote:
>
>>> The 32 general registers aren't _quite_ general. They're divided into
>>> four groups of eight.
>
>> Errm, splitting up registers like this is likely to hurt far more than
>> anything that 16-bit displacements are likely to gain.
>
>For 32-bit instructions, the only implication is that the first few
>integer registers would be used as index registers, and the last few
>would be used as base registers, which is likely to be true in any
>case.

As soon as you make 'general purpose registers' not 'general'
you've significantly complicated register allocation in compilers
and likely caused additional memory accesses due to the need to
spill registers unnecessarily.

MitchAlsup

unread,
Nov 10, 2023, 1:26:22 PM11/10/23
to
BGB wrote:

> On 11/9/2023 7:11 PM, MitchAlsup wrote:
>> Quadibloc wrote:
>>

> Good to see you are back on here...


>>> Some progress has been made in advancing a small step towards sanity
>>> in the description of the Concertina II architecture described at
>>
>>> http://www.quadibloc.com/arch/ct17int.htm
>>
>>> As Mitch Alsup has rightly noted, I want to have my cake and eat it
>>> too. I want an instruction format that is quick to fetch and decode,
>>> like a RISC format. I want RISC-like banks of 32 registers, and I
>>> want the CISC-like addressing modes of the IBM System/360, but with
>>> 16-bit displacements, not 12-bit displacements.
>> <
>> My 66000 has all of this.
>> <
>>> I want memory-reference instructions to still fit in 32 bits, despite
>>> asking for so much more capacity.
>> <
>> The simple/easy ones definitely, the ones with longer displacements no.
>> <

> Yes.

> As noted a few times, as I see it, 9 .. 12 is sufficient.
> Much less than 9 is "not enough", much more than 12 is wasting entropy,
> at least for 32-bit encodings.
<
Can you suggest something I could have done by sacrificing 16-bits
down to 12-bits that would have improved "something" in my ISA ??
{{You see I did not have any trouble in having all 16-bits for MEM
references--just like having 16-bits for integer, logical, and branch
offsets.}}
<
> 12u-scaled would be "pretty good", say, being able to handle 32K for
> QWORD ops.
<
IBM 360 found so, EMBench is replete with stack sizes and struct sizes
where My 66000 uses 1×32-bit instruction where RISC-V needs 2×32-bit...
Exactly the difference between 12-bits and 14-bits....

>>> So what I had done was, after squeezing as much as I could into a basic
>>> instruction format, I provided for switching into alternate instruction
>>> formats which made different compromises by using the block headers.
>> <
>> Block headers are simply consuming entropy.
>> <

> Also yes.


>>> This has now been dropped. Since I managed to get the normal (unaligned)
>>> memory-reference instruction squeezed into so much less opcode space that
>>> I also had room for the aligned memory-reference format without
>>> compromises
>>> in the basic instruction set, it wasn't needed to have multiple
>>> instruction
>>> formats.
>> <
>> I never had any aligned memory references. The HW overhead to "fix" the
>> problem is so small as to be compelling.
>> <

> In my case, it is only for 128-bit load/store operations, which require
> 64-bit alignment.
<
VVM does all the wide stuff without necessitating the wide stuff in
registers or instructions.
<
> Well, and an esoteric edge case:
> if((PC&0xE)==0xE)
> You can't use a 96-bit encoding, and will need to insert a NOP if one
> needs to do so.
<
Ehhhhh...
<
> One can argue that aligned-only allows for a cheaper L1 D$, but also
> "sucks pretty bad" for some tasks:
> Fast memcpy;
> LZ decompression;
> Huffman;
> ...
<
Time found that HW can solve the problem way more than adequately--
obviating its inclusion entirely. {Sooner or later Reduced leads RISC}

BGB

unread,
Nov 10, 2023, 1:26:33 PM11/10/23
to
Yeah.

Either banks of 8, or an 8 data + 8 address, or ... would kinda "rather
suck".

Or, even smaller cases, like, "most instructions can use all the
registers, but these ops only work on a subset" is kind of an annoyance
(this is a big part of why I bothered with the whole XG2 thing).


Much better to have a big flat register space.


Though, within reason.
Say:
* 8: Pain, can barely hold anything in registers.
** One barely has enough for working values for expressions, etc.
* 16: Not quite enough, still lots of spill/fill.
* 32: Can work well, with a good register allocator;
* 64: Can largely eliminate spill/fill, but a little much.
* 128: Too many.
* 256: Absurd.

So, say, 32 and 64 seem to be the "good" area, where with 32, a majority
of the functions can sit comfortably with most or all of their variables
held in registers. But, for functions with a large number of variables
(say, 100 or more), spill/fill becomes an issue (*).

Having 64 allows a majority of functions to use a "static assign
everything" strategy, where spill/fill can be eliminated entirely (apart
from the prolog/epilog sequences), and otherwise seems to deal better
with functions with large numbers of variables.


*: And is more of a pain with a register allocator design which can't
keep any non-static-assigned values in registers across basic-block
boundaries. This issue is, ironically, less obvious with 16 registers
(since spill/fill runs rampant anyways). But having nearly every basic
block start with a blob of stack loads, and end with a blob of stores,
only to reload them all again on the other side of a label, is fairly
obvious.

Having 64 registers does at least mostly hit this nail...


Meanwhile, for 128, there aren't really enough variables and temporaries
in most functions to make effective use of them. Also, 7 bit register
fields wont fit easily into a 32-bit instruction word.


As for register arguments:
* Probably 8 or 16.
** 8 makes the most sense with 32 GPRs.
*** 16 is asking too much.
*** 8 deals with around 98% of functions.
** 16 makes sense with 64 GPRs.
*** Nearly all functions can use exclusively register arguments.
*** Gain is small though, if it only benefits 2% of functions.
*** It is almost a "shoe in", except for cost of fixed spill space
*** 128 bytes at the bottom of every non-leaf stack-frame is noticeable.
*** Though, an ABI could decide to not have a spill space in this way.

Though, admittedly, for a lot of my programs I had still ended up going
with 8 register arguments with 64 GPRs, mostly as the gains of 16
arguments is small, relative of the cost of spending an additional 64
bytes in nearly every stack frame (and also there are still some
unresolved bugs when using 16 argument mode).

...



Current leaning is also that:
32-bit primary instruction size;
32/64/96 bit for variable-length instructions;
Is "pretty good".

In performance-oriented use cases, 16-bit encodings "aren't really worth
it".
In cases where you need a 32 or 64 bit value, being able to encode them
or load them quickly into a register is ideal. Spending multiple
instructions to glue a value together isn't ideal, nor is needing to
load it from memory (this particularly sucks from the compiler POV).


As for addressing modes:
(Rb, Disp) : ~ 66-75%
(Rb, Ri) : ~ 25-33%
Can address the vast majority of cases.

Displacements are most effective when scaled by the size of the element
type, as unaligned displacements are exceedingly rare. The vast majority
of displacements are also positive.

Not having a register-indexed mode is shooting oneself in the foot, as
these are "not exactly rare".

Most other possible addressing modes can be mostly ignored.
Auto-increment becomes moot if one has superscalar or VLIW;
(Rb, Ri, Disp) is only really applicable in niche cases
Eg, array inside struct, etc.
...



RISC-V did sort of shoot itself in the foot in several of these areas,
albeit with some workarounds in "Bitmanip":
SHnADD, can mimic a LEA, allowing array access in fewer ops.
PACK, allows an inline 64-bit constant load in 5 instructions...
LUI+ADD+LUI+ADD+PACK
...

Still not ideal...

An extra cycle for memory access is not ideal for a close second place
addressing mode; nor are 64-bit constants rare enough that one
necessarily wants to spend 5 or so clock cycles on them.

But, still better than the situation where one does not have these
instructions.

...

MitchAlsup

unread,
Nov 10, 2023, 1:31:38 PM11/10/23
to
Quadibloc wrote:

> On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
>> On 11/9/2023 3:51 PM, Quadibloc wrote:

>>> The 32 general registers aren't _quite_ general. They're divided into
>>> four groups of eight.

>> Errm, splitting up registers like this is likely to hurt far more than
>> anything that 16-bit displacements are likely to gain.

> For 32-bit instructions, the only implication is that the first few
> integer registers would be used as index registers, and the last few
> would be used as base registers, which is likely to be true in any
> case.

> It's only in the 16-bit operate instructions that this splitting of
> registers is actively present as a constraint. It is needed to make
> 16-bit operate instructions possible.

> So the cure is that if a compiler finds this too much trouble, it
> doesn't have to use the 16-bit instructions.
<
Then why are they there ??
<
I think you will find (like RISC-V is) that having and not mandating use
means you get a bit under ½ of what you think you are getting.
<
> Of course, if compilers can't use them, that raises the question of
> whether 16-bit instructions are worth having. Without them, the
> complications that I needed to be happy about my memory-reference
> instructions could have been entirely avoided.
<
There is a subset of RISC-V designers who want to discard the 16-bit
subset in order to solve the problems of the 32-bit set.
<
I might note: given the space of the compressed ISA in RISC-V, I could
install the entire My 66000 ISA and then not need any of the RISC-V
ISA.....
<
> John Savard

MitchAlsup

unread,
Nov 10, 2023, 1:31:38 PM11/10/23
to
Quadibloc wrote:

> On Fri, 10 Nov 2023 00:29:00 +0000, John Dallman wrote:

>> In article <uijjoj$2dc2i$1...@dont-email.me>, quad...@servername.invalid
>> (Quadibloc) wrote:
>>
>>> Actually, it's worse than that, since I also want the virtues of
>>> processors like the TMS320C2000 or the Itanium.
>>
>> What do you consider the virtues of Itanium to be?

Itanic's main virtue was to consume several Intel design teams, over 20
years, preventing Intel from taking over the entire µprocessor market.

I, personally, don't believe in exposing the scalarity to the compiler,
nor the rotating register file to do what renaming does naturally,
nor the lack of proper FP instructions (FDIV, SQRT), ...

Academic quality at industrial prices.

BGB

unread,
Nov 10, 2023, 1:50:38 PM11/10/23
to
RISC-V is 12-bit signed unscaled (which can only do +/- 2K).

On average, 12-bit signed unscaled is actually worse than 9-bit unsigned
scaled (4K range, for QWORD).

So, ironically, despite BJX2 having smaller displacements than RISC-V,
it actually deals better with the larger stack frames.


But, if one could address 32K, this should cover the vast majority of
structs and stack-frames.


A 16-bit unsigned scaled displacement would cover 512K for QWORD ops,
which could be nice, but likely unnecessary.
This is mostly due to a quirk in the L1 I$ design, where "fixing" it
costs more than just being like, "yeah, this case isn't allowed" (and
having the compiler emit a NOP in the rare edge cases it is encountered).


>> One can argue that aligned-only allows for a cheaper L1 D$, but also
>> "sucks pretty bad" for some tasks:
>>    Fast memcpy;
>>    LZ decompression;
>>    Huffman;
>>    ...
> <
> Time found that HW can solve the problem way more than adequately--
> obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
> <
>

Wait, are you arguing for aligned-only memory ops here?...

But, yeah, for me, a major selling points for unaligned access is mostly
that I can copy blocks of memory around like:
v0=((uint64_t *)cs)[0];
v1=((uint64_t *)cs)[1];
v2=((uint64_t *)cs)[2];
v3=((uint64_t *)cs)[3];
((uint64_t *)ct)[0]=v0;
((uint64_t *)ct)[1]=v1;
((uint64_t *)ct)[2]=v2;
((uint64_t *)ct)[3]=v3;
cs+=32; ct+=32;

For Huffman, some of the fastest strategies to implement the bitstream
reading/writing, tend to be to casually make use of unaligned access
(shifting in and loading bytes is slower in comparison).

Though, all this falls on its face, if encountering a CPU that uses
traps to emulate unaligned access (apparently a lot of the SiFive cores
and similar).

Stephen Fuld

unread,
Nov 10, 2023, 2:17:43 PM11/10/23
to
On 11/10/2023 10:24 AM, BGB wrote:
> On 11/10/2023 8:51 AM, Scott Lurndal wrote:
>> Quadibloc <quad...@servername.invalid> writes:
>>> On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
>>>> On 11/9/2023 3:51 PM, Quadibloc wrote:
>>>
>>>>> The 32 general registers aren't _quite_ general. They're divided into
>>>>> four groups of eight.
>>>
>>>> Errm, splitting up registers like this is likely to hurt far more than
>>>> anything that 16-bit displacements are likely to gain.
>>>
>>> For 32-bit instructions, the only implication is that the first few
>>> integer registers would be used as index registers, and the last few
>>> would be used as base registers, which is likely to be true in any
>>> case.
>>
>> As soon as you make 'general purpose registers' not 'general'
>> you've significantly complicated register allocation in compilers
>> and likely caused additional memory accesses due to the need to
>> spill registers unnecessarily.
>
> Yeah.
>
> Either banks of 8, or an 8 data + 8 address, or ... would kinda "rather
> suck".
>
> Or, even smaller cases, like, "most instructions can use all the
> registers, but these ops only work on a subset" is kind of an annoyance
> (this is a big part of why I bothered with the whole XG2 thing).
>
>
> Much better to have a big flat register space.

Yes, but sometimes you just need "another bit" in the instructions. So
an alternative is to break the requirement that all register specifier
fields in the instruction be the same length. So, for example, allow
access to all registers from one source operand position, but say only
half from the other source operand position. So, for a system with 32
registers, you would need 5 plus 5 plus 4 bits. Much of the time, such
as with commutative operations like adds, this doesn't hurt at all.

Yes, this makes register allocation in the compiler harder. And
occasionally you might need an extra instruction to copy a value to the
half size field, but on high end systems, this can be done in the rename
stage without taking an execution slot.

A more extreme alternative is to only allow the destination field to
also be one bit smaller. Of course, this makes things even harder for
the compiler, and probably requires extra "copy" instructions more
frequently, but sometimes you just gotta do what you gotta do. :-(

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Thomas Koenig

unread,
Nov 10, 2023, 5:03:28 PM11/10/23
to
Quadibloc <quad...@servername.invalid> schrieb:
> On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:
>> On 11/9/2023 3:51 PM, Quadibloc wrote:
>
>>> The 32 general registers aren't _quite_ general. They're divided into
>>> four groups of eight.
>
>> Errm, splitting up registers like this is likely to hurt far more than
>> anything that 16-bit displacements are likely to gain.
>
> For 32-bit instructions, the only implication is that the first few
> integer registers would be used as index registers, and the last few
> would be used as base registers, which is likely to be true in any
> case.

This breaks with the central tenet of the /360, the PDP-11,
the VAX, and all RISC architectures: (Almost) all registers are
general-purpose registers.

This would make your ISA very un-S/360-like.

MitchAlsup

unread,
Nov 10, 2023, 6:21:20 PM11/10/23
to
BGB wrote:

> On 11/10/2023 12:22 PM, MitchAlsup wrote:
>
>>> One can argue that aligned-only allows for a cheaper L1 D$, but also
>>> "sucks pretty bad" for some tasks:
>>>    Fast memcpy;
>>>    LZ decompression;
>>>    Huffman;
>>>    ...
>> <
>> Time found that HW can solve the problem way more than adequately--
>> obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
>> <
>>

> Wait, are you arguing for aligned-only memory ops here?...
<

No, I am arguing that all memory references are inherently un aligned, but where
aligned references never suffer a stall penalty; and the the compiler does not
need to understand if the reference is aligned or unaligned.
<
> But, yeah, for me, a major selling points for unaligned access is mostly
> that I can copy blocks of memory around like:
> v0=((uint64_t *)cs)[0];
> v1=((uint64_t *)cs)[1];
> v2=((uint64_t *)cs)[2];
> v3=((uint64_t *)cs)[3];
> ((uint64_t *)ct)[0]=v0;
> ((uint64_t *)ct)[1]=v1;
> ((uint64_t *)ct)[2]=v2;
> ((uint64_t *)ct)[3]=v3;
> cs+=32; ct+=32;
<
MM Rcs,Rct,#length // without the for loop
<
> For Huffman, some of the fastest strategies to implement the bitstream
> reading/writing, tend to be to casually make use of unaligned access
> (shifting in and loading bytes is slower in comparison).

> Though, all this falls on its face, if encountering a CPU that uses
> traps to emulate unaligned access (apparently a lot of the SiFive cores
> and similar).
<
Traps to perform unaligned are so 1985......either don't allow them at all
(SIGSEGV) or treat them as first class citizens. The former fails in the market.
<
>

MitchAlsup

unread,
Nov 10, 2023, 6:27:14 PM11/10/23
to
But follows S.E.L 32/{...} series and several other minicomputers with
isolated base registers. In the 32/{..} series, there was 2 LDs and 2 STs
1 LD was byte (signed) with 19-bit displacement
2 LD was size (signed) with the lower bits of displacement specifying size.
3 ST was byte <ibid>
3 ST was size <ibid>
<
only registers 1-7 could be used as base register.
<
I saw several others using similar tricks but can't remember.....

BGB

unread,
Nov 10, 2023, 9:40:07 PM11/10/23
to
On 11/10/2023 5:21 PM, MitchAlsup wrote:
> BGB wrote:
>
>> On 11/10/2023 12:22 PM, MitchAlsup wrote:
>>
>>>> One can argue that aligned-only allows for a cheaper L1 D$, but also
>>>> "sucks pretty bad" for some tasks:
>>>>    Fast memcpy;
>>>>    LZ decompression;
>>>>    Huffman;
>>>>    ...
>>> <
>>> Time found that HW can solve the problem way more than adequately--
>>> obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
>>> <
>>>
>
>> Wait, are you arguing for aligned-only memory ops here?...
> <
>
> No, I am arguing that all memory references are inherently un aligned,
> but where
> aligned references never suffer a stall penalty; and the the compiler
> does not
> need to understand if the reference is aligned or unaligned.
> <

OK, fair enough.

I don't have separate aligned/unaligned ops for anything QWORD or
smaller, as all these cases are implicitly unaligned.

Though, aligned is sometimes a little faster, due to playing better with
the L1 cache; but, using misaligned memory access is generally faster
than any of the traditional workarounds (the difference being mostly a
slight increase in the probability of triggering an L1 cache miss).


The main exception is MOV.X requiring 64-bit alignment (for a 128-bit
memory access), but the unaligned fallback here is to use a pair of
MOV.Q instructions instead.

But, this was in part because of how the L1 caches were implemented, and
supporting fully unaligned 128-bit access would have been more expensive
(and the relative gain is smaller).

This does mean alternate logic for aligned vs unaligned "memcpy()", with
the unaligned case being a little slower as a result of needing to use
MOV.Q ops.


It is possible a case could be made for allowing fully unaligned MOV.X
as well.

Would mostly involve reworking how MOV.X is implemented relative to the
extract/insert logic (likely internally working with 192 bits rather
than 128; with as-is, MOV.X implemented by bypassing the main
extract/insert logic).


>> But, yeah, for me, a major selling points for unaligned access is
>> mostly that I can copy blocks of memory around like:
>>    v0=((uint64_t *)cs)[0];
>>    v1=((uint64_t *)cs)[1];
>>    v2=((uint64_t *)cs)[2];
>>    v3=((uint64_t *)cs)[3];
>>    ((uint64_t *)ct)[0]=v0;
>>    ((uint64_t *)ct)[1]=v1;
>>    ((uint64_t *)ct)[2]=v2;
>>    ((uint64_t *)ct)[3]=v3;
>>    cs+=32; ct+=32;
> <
>     MM   Rcs,Rct,#length            // without the for loop
> <

I typically use a "while()" loop or similar, but yeah...

At present, the fastest loop strategy is generally:
while(n--)
{
...
}




>> For Huffman, some of the fastest strategies to implement the bitstream
>> reading/writing, tend to be to casually make use of unaligned access
>> (shifting in and loading bytes is slower in comparison).
>
>> Though, all this falls on its face, if encountering a CPU that uses
>> traps to emulate unaligned access (apparently a lot of the SiFive
>> cores and similar).
> <
> Traps to perform unaligned are so 1985......either don't allow them at all
> (SIGSEGV) or treat them as first class citizens. The former fails in the
> market.
> <
>>

Apparently SiFive went this way, for some reason...

Like, RISC-V requires unaligned access to work, but doesn't specify how,
and apparently they considered trapping to be an acceptable option, but
trapping sucks for performance.



Quadibloc

unread,
Nov 11, 2023, 12:40:04 AM11/11/23
to
On Thu, 09 Nov 2023 17:49:03 -0600, BGB-Alt wrote:

> Errm, splitting up registers like this is likely to hurt far more than
> anything that 16-bit displacements are likely to gain.

No doubt you're right.

As that means my 16-bit instructions, with the registers split into four
parts, are useless to compilers, now I have to go around in circles again.
I thought I had finally achieved a single instruction format that satisfied
my ambitions - and now I find it is fatally flawed.

One possibility is to go back to the full format for 32-bit memory
reference instructions. That will still leave me enough opcode space that a
four-bit prefix could precede three 20-bit short instructions. To avoid
creating a variable-length instruction set, which complicates decoding,
I would require such blocks to be aligned on 64-bit boundaries.

So now there's a nested block structure, of 64-bit blocks inside 256-bit
blocks!

John Savard

John Dallman

unread,
Nov 11, 2023, 1:50:50 AM11/11/23
to
In article <uijk93$2dc2i$2...@dont-email.me>, quad...@servername.invalid
(Quadibloc) wrote:

> This lends itself to writing code where four distinct threads are
> interleaved, helping pipelining in implementations too cheap to have
> out-of-order executiion.

This is not the conventional way of implementing threads, and seems to
have some drawbacks:

One of the uses of threads is to scale to the hardware resources
available. With this approach, the number of threads is baked in at
compile time.

Debugging such interleaved threads is likely to be even more confusing
than debugging multiple threads usually is.

Pipeline stalls affect every thread, rather than just the thread that
triggers them.

The common threading APIs also lack a way to set such threads to work,
but that's a far more soluble problem.

John

John Dallman

unread,
Nov 11, 2023, 2:08:04 AM11/11/23
to
In article <uikbng$2lh5f$1...@dont-email.me>, quad...@servername.invalid
(Quadibloc) wrote:

> Well, I think that superscalar operation of microprocessors is a
> good thing.

Indeed.

> Explicitly indicating which instructions may execute in parallel
> is one way to facilitate that. Even if the Itanium was an
> unsuccessful implementation of that principle.

Intel tried that with the Pentium, with its two pipelines and run-time
automatic instruction scheduling, to moderate success. They tried it with
the i860, with compiler scheduling and a comprehensive lack of success.
The Itanium tried the i860 method, much harder and was still unsuccessful.


In engineering, the gap between "Doing this would be good" and "Here it
is working" generally involves having a good idea about /how/ to do it.

Finding an example where explicit but non-automatic parallelism worked
for general-purpose code and figuring out how that was done should be
easier than inventing a method. In the absence of that, we have some
evidence that just hoping the software people will solve this problem for
you doesn't work.

John

Anton Ertl

unread,
Nov 11, 2023, 2:49:46 AM11/11/23
to
BGB <cr8...@gmail.com> writes:
>On 11/10/2023 12:22 PM, MitchAlsup wrote:
>> BGB wrote:
>>> One can argue that aligned-only allows for a cheaper L1 D$, but also
>>> "sucks pretty bad" for some tasks:
>>>    Fast memcpy;
>>>    LZ decompression;
>>>    Huffman;
>>>    ...

Hashing

>Though, all this falls on its face, if encountering a CPU that uses
>traps to emulate unaligned access (apparently a lot of the SiFive cores
>and similar).

Let's see what this SiFive U74 does:

[fedora-starfive:~/nfstmp/gforth-riscv:98397] perf stat -e instructions:u -e cycles:u gforth-fast -e "create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x x ! x foo bye "

Performance counter stats for 'gforth-fast -e create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x x ! x foo bye ':

469832112 instructions:u # 0.79 insn per cycle
591015904 cycles:u

0.609751748 seconds time elapsed

0.533195000 seconds user
0.061522000 seconds sys


[fedora-starfive:~/nfstmp/gforth-riscv:98398] perf stat -e instructions:u -e cycles:u gforth-fast -e "create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x 1+ x 1+ ! x 1+ foo bye "

Performance counter stats for 'gforth-fast -e create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x 1+ x 1+ ! x 1+ foo bye ':

53533370273 instructions:u # 0.77 insn per cycle
69304924487 cycles:u

69.368484169 seconds time elapsed

69.256290000 seconds user
0.049997000 seconds sys

So when we do aligned accesses (first command), the code performs 4.7
instructions and 5.9 cycles per load, while for unaligned accesses
(second command) the same code performs 535.3 instructions and 693.0
cycles per load. So apparently an unaligned load triggers >500
additional instructions, confirming your claim. Interestingly, all
that is attributed to user time; maybe the fixup is performed by a
user-level trap or microcode.

Still, the approach of having separate instructions for aligned and
unaligned accesses (typically with several instructionf for the
unaligned case) has been tried and discarded. Software just does not
declare that some access will be unaligned.

A particularly strong evidence for this is that gas generated
non-working code for ustq (unaligned store quadword) on Alpha for
several years, and apparently nobody noticed until I gave an exercise
to my students where they should use ustq (so no production use,
either).

So, every general-purpose architecture, including RISC-V, the
spiritual descendent of MIPS and Alpha (which had the division),
settled on having memory access instructions that perform both aligned
and unaligned accesses (with performance advantages for aligned
accesses).

If RISC-V implementations want to perform well for code that uses
unaligned accesses for memory copying, compression/decompression, or
hashing, they will eventually have to implement unaligned accesses
more efficiently, but at least the code works, and aligned accesses
are fast.

Why would you not go the same way? It would also save on instruction
encoding space.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

John Dallman

unread,
Nov 11, 2023, 3:37:51 AM11/11/23
to
In article <2023Nov1...@mips.complang.tuwien.ac.at>,
an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> Let's see what this SiFive U74 does:
...

> So apparently an unaligned load triggers >500 additional instructions,
> confirming your claim.

Wow. I think I'd rather have SIGBUS on unaligned accesses. That is at
least obvious. Slowdowns like this will be a major drag on performance,
simply because finding them all is tricky.

John

BGB

unread,
Nov 11, 2023, 4:05:46 AM11/11/23
to
On 11/11/2023 1:22 AM, Anton Ertl wrote:
> BGB <cr8...@gmail.com> writes:
>> On 11/10/2023 12:22 PM, MitchAlsup wrote:
>>> BGB wrote:
>>>> One can argue that aligned-only allows for a cheaper L1 D$, but also
>>>> "sucks pretty bad" for some tasks:
>>>>    Fast memcpy;
>>>>    LZ decompression;
>>>>    Huffman;
>>>>    ...
>
> Hashing
>

Possibly true.


Some of my data hash/checksum functions were along the lines of:
uint32_t *cs, *cse;
uint64_t v0, v1, v;

cs=buf; cse=buf+((sz+3)>>2);
v0=1; v1=1;
while(cs<cse)
{
v=*cs++;
v0+=v;
v1+=v0;
}
v0=((uint32_t)v0)+(v0>>32); //*
v1=((uint32_t)v1)+(v1>>32);
v0=((uint32_t)v0)+(v0>>32);
v1=((uint32_t)v1)+(v1>>32);
v=(uint32_t)(v0^v1);

*: This step may seem frivolous, but seems to increase the strength of
the checksum.

There are faster variants, but this one gives the general idea.
Not aware of anyone else doing it this way, but it is faster than either
Adler32 or CRC32, while giving some similar properties (the second sum
detecting various issues which would be missed with a single sum).

A faster variant of this being to run multiple sets of sums in parallel
and then combine the values at the end.
I wasn't that sure how it was implemented, but it is "kinda weak" in any
case.

On the BJX2 core, the performance impact of using misaligned load and
store is approximately 3% in my tests, I suspect mostly due to a
slightly higher incidence of L1 cache misses.


> Still, the approach of having separate instructions for aligned and
> unaligned accesses (typically with several instructionf for the
> unaligned case) has been tried and discarded. Software just does not
> declare that some access will be unaligned.
>
> A particularly strong evidence for this is that gas generated
> non-working code for ustq (unaligned store quadword) on Alpha for
> several years, and apparently nobody noticed until I gave an exercise
> to my students where they should use ustq (so no production use,
> either).
>
> So, every general-purpose architecture, including RISC-V, the
> spiritual descendent of MIPS and Alpha (which had the division),
> settled on having memory access instructions that perform both aligned
> and unaligned accesses (with performance advantages for aligned
> accesses).
>
> If RISC-V implementations want to perform well for code that uses
> unaligned accesses for memory copying, compression/decompression, or
> hashing, they will eventually have to implement unaligned accesses
> more efficiently, but at least the code works, and aligned accesses
> are fast.
>
> Why would you not go the same way? It would also save on instruction
> encoding space.
>

I was never claiming that one should have separate instructions (since,
if the L1 cache supports unaligned access, what is the point of having
aligned only variants of the instructions?...).


Rather, that it might make sense to do an aligned-only core, and then
trap on misaligned (possibly allowing the access to be emulated, like if
SiFive cores); mostly in the name of making the L1 cache cheaper.


A few of my small core experiments had used aligned-only L1 caches, but
I mostly went with a natively unaligned designs for my bigger ISA
designs, mostly as I tend to make frequent use of unaligned memory
access as a "performance trick".



However, BJX2 has a natively unaligned L1 cache (well, apart from MOV.X).

Have gone and added the logic to allow MOV.X to be unaligned as well,
which mostly has the effect of a minor increase in LUT cost and similar
(mostly as the internal extract/insert logic needed to be widened from
128 to 192 bits to deal with this; with MOV.X now being handled in a
similar way to MOV.Q when this feature is enabled).


Though, one thing is whether to "formally fix" the Op96 at
((PC&0xE)==0xE) issue. Ironically, in this case, the "fix" is already
present in the Verilog code, just the restriction exists more as a
"break glass to save some LUTs" option.


Well, along with some other wonk, like leaving it as undefined what
happens if the instruction stream is allowed to cross a 4GB boundary,
... Branching is fine, just the PC increment logic can save some latency
by not bothering with the high 16 bits.

I guess, in an ideal world, there wouldn't be a lot of this wonk, but
needing to often battle with timing constraints and similar does create
incentive for corner cutting in various areas.


> - anton

Anton Ertl

unread,
Nov 11, 2023, 6:03:11 AM11/11/23
to
j...@cix.co.uk (John Dallman) writes:
>In article <2023Nov1...@mips.complang.tuwien.ac.at>,
>an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>> So apparently an unaligned load triggers >500 additional instructions,
>> confirming your claim.
>
>Wow. I think I'd rather have SIGBUS on unaligned accesses. That is at
>least obvious.

True, but that has been tried out and, in a world (like Linux) where
software is developed on a platform that supports unaligned accesses,
and then compiled by package maintainers (who often are not that
familiar with the software) on a lot of platforms, the end result was
that the kernel by default performed a fixup (and put a message in the
dmesg buffer) instead of delivering a SIGBUS.

There was a system call for switching to the SIGBUS behaviour. On
Tru64 OSF/1 (or whatever it is called this week), the default
behaviour was to SIGBUS, but it had the same system call, and a
shell-level tool "uac" to change the behaviour to fix it up. I
implemented a tool "uace" for Linux that can be used for running a
process with the SIGBUS behaviour that you desire:
<https://www.complang.tuwien.ac.at/anton/uace.c>. Maybe something
similar is possible on the U74.

Anyway, it seems that the problems was not a big one on Linux-Alpha
(messages about unaligned accesses were not that frequent).
Apparently the large majority of code performs aligned accesses. It's
just that there are a few unaligned ones.

I would not worry about cores like the U74 (and I have a program that
uses unaligned accesses for hashing); that's just a stepping stone for
getting more capable RISC-V cores, and at some point (before RISC-V
becomes mainstream) the trapping will be replaced with something more
efficient.

We have seen the same development on AMD64. The Penryn
(second-generation Core 2) takes 159 cycles for an unaligned load that
crosses a page boundary, the Sandy Bridge takes 28
<http://al.howardknight.net/?ID=143135464800>. The Sandy Bridge and
Ivy Bridge take 200 cycles for an unaligned page-crossing store,
Haswell and Skylake take 25 and 24.

Anton Ertl

unread,
Nov 11, 2023, 6:44:20 AM11/11/23
to
BGB <cr8...@gmail.com> writes:
>On 11/11/2023 1:22 AM, Anton Ertl wrote:
>> Hashing
>>
>
>Possibly true.

Definitely true: The data you want to hash may be aligned to byte
boundaries (e.g., strings), but a fast hash function loads it at the
largest granularity possible and also processes the loaded values at
the largest granularity possible.

And in contrast to block copying, where you can do some prelude, then
perform aligned accesses, and then a postlude (at least on one side of
the copying), for this kind of hashing you want to have in the first
step, the first n bytes in a register, because the first byte
influences the hash function result differently than the second byte.

What you could do is load aligned into a shift buffer (in a register),
and then use something like AMD64's shld to get the data in the needed
form. Same for the second side of block copying. But is this faster
on modern CPUs?

John Dallman

unread,
Nov 11, 2023, 11:53:32 AM11/11/23
to
In article <2023Nov1...@mips.complang.tuwien.ac.at>,
an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> True, but that has been tried out and, in a world (like Linux) where
> software is developed on a platform that supports unaligned
> accesses, and then compiled by package maintainers (who often are
> not that familiar with the software) on a lot of platforms, the end
> result was that the kernel by default performed a fixup (and put a
> message in the dmesg buffer) instead of delivering a SIGBUS.

Yup. The software I work on is meant, in itself, to work on platforms
that enforce alignment, and it was a useful catcher for some kinds of bug.
However, I'm now down to one that actually enforces it, in SPARC Solaris,
and that isn't long for this world.

I dug into what it would take to have x86-64 Linux work with alignment
enforcement turned on, and it's a huge job.

John

MitchAlsup

unread,
Nov 11, 2023, 1:12:10 PM11/11/23
to
Stephen Fuld wrote:

> On 11/10/2023 10:24 AM, BGB wrote:
>
>>
>>
>> Much better to have a big flat register space.

> Yes, but sometimes you just need "another bit" in the instructions. So
> an alternative is to break the requirement that all register specifier
> fields in the instruction be the same length. So, for example, allow
<
Another way to get a few more bits is to use a prefix-instruction like
CARRY for those seldom needed bits.

Chris M. Thomasson

unread,
Nov 11, 2023, 2:30:24 PM11/11/23
to
On 11/10/2023 11:22 PM, Anton Ertl wrote:
> BGB <cr8...@gmail.com> writes:
>> On 11/10/2023 12:22 PM, MitchAlsup wrote:
>>> BGB wrote:
>>>> One can argue that aligned-only allows for a cheaper L1 D$, but also
>>>> "sucks pretty bad" for some tasks:
>>>>    Fast memcpy;
>>>>    LZ decompression;
>>>>    Huffman;
>>>>    ...
>
> Hashing
[...]

Fwiw, proper alignment is very important wrt a programmer to gain some
of the benefits of, basically, virtually "any" target architecture. For
instance, _proper_ cache line alignment, say, L2, and its 64 bytes. So,
the programmer can set up an array that is aligned on a cache line
boundary and pad each element of said array up to the size of a L2 cache
line.

Two steps... Align your memory on a proper cache line boundary, and pad
the size of each element up to the size of a single cache line.

Think of LL/SC... If one did not honor the reservation granule....
well... Shit.. False sharing on a reservation granule can cause live
lock and damage forward progress wrt some LL/SC setups.

BGB-Alt

unread,
Nov 11, 2023, 3:33:25 PM11/11/23
to
On 11/11/2023 12:11 PM, MitchAlsup wrote:
> Stephen Fuld wrote:
>
>> On 11/10/2023 10:24 AM, BGB wrote:
>>
>>>
>>>
>>> Much better to have a big flat register space.
>
>> Yes, but sometimes you just need "another bit" in the instructions.
>> So an alternative is to break the requirement that all register
>> specifier fields in the instruction be the same length.  So, for
>> example, allow
> <
> Another way to get a few more bits is to use a prefix-instruction like
> CARRY for those seldom needed bits.
> <

Or, a similar role is served by my Jumbo-Op64 prefix.

So, there are two different Jumbo prefixes:
Jumbo-Imm, which mostly just makes the immed/disp field bigger;
Jumbo-Op64, which mostly extends the opcode and other things;
May extend immediate, but less so, and that is not its main purpose.

Op64 also does, optionally:
Being the original mechanism to address R32..R63, before XGPR and XG2
encodings were added, and needed (in Baseline) for the parts of the ISA
not covered by the XGPR encodings;
Adds a potential 4'th register, extra displacement (or smaller Immed
extension), or rounding-mode / opcode bits (depends on the base
instruction).

As-is, 8 bits in the Op64 prefix are Must Be Zero, as-is, they are
designated specifically towards expanding the opcode space (with the 00
case designated as mapping to the same instruction as in the basic
32-bit encoding).

Thomas Koenig

unread,
Nov 11, 2023, 4:22:05 PM11/11/23
to
Chris M. Thomasson <chris.m.t...@gmail.com> schrieb:
> On 11/10/2023 11:22 PM, Anton Ertl wrote:
>> BGB <cr8...@gmail.com> writes:
>>> On 11/10/2023 12:22 PM, MitchAlsup wrote:
>>>> BGB wrote:
>>>>> One can argue that aligned-only allows for a cheaper L1 D$, but also
>>>>> "sucks pretty bad" for some tasks:
>>>>>    Fast memcpy;
>>>>>    LZ decompression;
>>>>>    Huffman;
>>>>>    ...
>>
>> Hashing
> [...]
>
> Fwiw, proper alignment is very important wrt a programmer to gain some
> of the benefits of, basically, virtually "any" target architecture. For
> instance, _proper_ cache line alignment, say, L2, and its 64 bytes. So,
> the programmer can set up an array that is aligned on a cache line
> boundary and pad each element of said array up to the size of a L2 cache
> line.
>
> Two steps... Align your memory on a proper cache line boundary, and pad
> the size of each element up to the size of a single cache line.

For smaller elements smaller than a cache line, that makes little
sense. as written. I think there is an unwritten assumption
"for elements larger than cache line" there, or we would all
be using 64-byte bools.

Scott Lurndal

unread,
Nov 11, 2023, 4:28:11 PM11/11/23
to