Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

The Impending Return of Concertina III

209 views
Skip to first unread message

Quadibloc

unread,
Jan 22, 2024, 11:07:54 PMJan 22
to
As I have noted, the original Concertina architecture was not a
serious proposal for a computer architecture, but merely a
description of an architecture intended to illustrate how
computers work.

Concertina II was a step above that; somewhat serious, but
not fully so; still too idiosyncratic to be taken seriously
as an alternative.

But in a discussion of Concertina II - or, rather, in a thread
that started with Concertina II, but went on to discussing
other things - it was noted that RISC-V is badly flawed.

In that case, an alternative is needed. I need to go beyond
Concertina II - with which I am satisfied now as meeting its
goals, finally - to something that could be considered genuinely
serious.

At the moment, only a link to Concertina III is present on my
main page, no content is yet present.

John Savard

Quadibloc

unread,
Jan 23, 2024, 1:41:24 AMJan 23
to
On Tue, 23 Jan 2024 04:07:50 +0000, I wrote:

> At the moment, only a link to Concertina III is present on my
> main page, no content is yet present.

The first few pages, with diagrams of this ultimate simplification
of Concertina II, are now present, starting at

http://www.quadibloc.com/arch/ct19int.htm

I've gone to 15-bit displacements, in order to avoid compromising
addressing modes, while allowing 16-bit instructions without
switching to an alternate instruction set.

Possibly using only three base registers is also sufficiently
non-violent to the addressing modes that I should have done that
instead, so I will likely give consideration to that option in
the days ahead.

Unfortunately, since pseudo-immediate values are something
of which I have been convinced of the necessity, I could not
get rid of block structure, which is, of course, as noted
the major impediment to this ISA being considered for widespread
adoption.

John Savard

Quadibloc

unread,
Jan 23, 2024, 4:50:34 AMJan 23
to
On Tue, 23 Jan 2024 06:41:20 +0000, I wrote:

> I've gone to 15-bit displacements, in order to avoid compromising
> addressing modes, while allowing 16-bit instructions without
> switching to an alternate instruction set.
>
> Possibly using only three base registers is also sufficiently
> non-violent to the addressing modes that I should have done that
> instead, so I will likely give consideration to that option in
> the days ahead.

I have indeed decided that using three base registers for the
basic load-store instructions is much preferable to shortening the
length of the displacement even by one bit.

John Savard

Robert Finch

unread,
Jan 23, 2024, 7:06:54 AMJan 23
to
Packing and unpacking DFP numbers does not take a lot of logic, assuming
one of the common DPD packing methods. The number of registers handling
DFP values could be doubled if they were unpacked and packed for each
operation. Since DFP arithmetic has a high latency anyway, for example
Q+ the DFP unit unpacks, performs the operation, then repacks the DFP
number. So, registers only need be 128-bit.

256 bits seems a little narrow for a vector register. I have seen
several other architectures with vector registers supporting 16+ 32-bit
values, or a length of 512-bits. This is also the width of a typical
cache line.

Having the base register implicitly encoded in the instruction is a way
to reduce the number of bits used to represent the base register. There
seems to be a lot of different base register usages. Will not that make
the compiler more difficult to write?

Does array addressing mode have memory indirect addressing? It seems
like a complex mode to support.

Block headers are tricky to use. They need to follow the output of the
instructions in the assembler so that the assembler has time to generate
the appropriate bits for the header. The entire instruction block needs
to be flushed at the end of a function.

Quadibloc

unread,
Jan 23, 2024, 8:07:54 AMJan 23
to
On Tue, 23 Jan 2024 07:06:47 -0500, Robert Finch wrote:

> Packing and unpacking DFP numbers does not take a lot of logic, assuming
> one of the common DPD packing methods.

Well, I'm thinking of the method used by IBM. It is true that method
was designed to use a minimal amount of logic.

> The number of registers handling
> DFP values could be doubled if they were unpacked and packed for each
> operation.

Not doubled, only increased from 24 to 32.

> Since DFP arithmetic has a high latency anyway, for example
> Q+ the DFP unit unpacks, performs the operation, then repacks the DFP
> number. So, registers only need be 128-bit.

I don't believe in wasting any time. And the latency of DFP operations
can be reduced; it is possible to design a Wallace Tree multiplier for
BCD arithmetic.

> 256 bits seems a little narrow for a vector register.

The original Concertina architecture, which had short vector registers
of that size, was designed before AVX-512 was invented. Rather than attempting
to keep revising the size of the short vector registers to keep up, the
ISA also includes long vector registers.

These are patterned after the vector registers of the Cray I, and have room
for 64 double-precision floating-point numbers each.

> I have seen
> several other architectures with vector registers supporting 16+ 32-bit
> values, or a length of 512-bits. This is also the width of a typical
> cache line.

> Having the base register implicitly encoded in the instruction is a way
> to reduce the number of bits used to represent the base register.

Instead of base registers, then, there would be a code segment register
and a data segment register, like on x86. But then how do I access data
belonging to another subroutine? Without variable length instructions,
segment prefixes like on x86 aren't an option. (There actually are
instruction prefixes in the ISA, but they're not intended to be
_common_!)

> There
> seems to be a lot of different base register usages. Will not that make
> the compiler more difficult to write?

I suppose it could. The idea is basically that a program would pick
one memory model and stick with it - a normal program would use the
base registers connected with 16-bit displacements for everything...
except that, where different routines share access to a small area of
memory, then that pointer can be put in a base register for 12-bit
displacements.

> Does array addressing mode have memory indirect addressing? It seems
> like a complex mode to support.

It does indeed use indirect addressing. The idea is that if your
program has a large number of arrays which are over 64K in size,
it shouldn't be necessary to either consume a base register for
each array, or freshly load a base register with the array address
every time it's referenced.

Using the mode is simple enough; basically, the address in the
instruction is effectively the name of the array instead of its
address, and the array is indexed normally.

Of course, there's the overhead of indirection on every access.

So in Concertina II, I had added a new addressing mode which
simply uses the same feature that allows immediate values to
tack a 64-bit absolute address on to an instruction. (Since it
looks like a 64-bit number, the linking loader can relocate it.)
That fancy feature, though, was too much complication for this
stripped-down ISA.

> Block headers are tricky to use. They need to follow the output of the
> instructions in the assembler so that the assembler has time to generate
> the appropriate bits for the header. The entire instruction block needs
> to be flushed at the end of a function.

I don't see an alternative, though, to block structure to allow instructions
to have, in the instruction stream, immediate values of any length, and yet
allow instructions to be rapidly decoded in parallel as if they were all
32 bits long.

And block structure also allows instruction parallelism to be explicitly
indicated.

If you decide not to use the block header feature, though, what you have
left is still a perfectly good ISA. So people can support the architecture
with a basic compiler which doesn't make full use of the chip's features,
and then a fancier compiler which produces more optimal code can make the
effort to handle the block headers.

John Savard

Quadibloc

unread,
Jan 23, 2024, 9:01:11 AMJan 23
to
On Tue, 23 Jan 2024 13:07:49 +0000, Quadibloc wrote:

> So in Concertina II, I had added a new addressing mode which
> simply uses the same feature that allows immediate values to
> tack a 64-bit absolute address on to an instruction. (Since it
> looks like a 64-bit number, the linking loader can relocate it.)
> That fancy feature, though, was too much complication for this
> stripped-down ISA.

This discussion has convinced me that this addressing mode,
although relegated to an alternate instruction set in Concertina II,
is important enough for maximizing performance that it does need
to be included in Concertina III, and the appropriate changes
have been made.

John Savard

BGB

unread,
Jan 23, 2024, 2:56:39 PMJan 23
to
On 1/23/2024 6:06 AM, Robert Finch wrote:
> On 2024-01-23 4:50 a.m., Quadibloc wrote:
>> On Tue, 23 Jan 2024 06:41:20 +0000, I wrote:
>>
>>> I've gone to 15-bit displacements, in order to avoid compromising
>>> addressing modes, while allowing 16-bit instructions without
>>> switching to an alternate instruction set.
>>>
>>> Possibly using only three base registers is also sufficiently
>>> non-violent to the addressing modes that I should have done that
>>> instead, so I will likely give consideration to that option in
>>> the days ahead.
>>
>> I have indeed decided that using three base registers for the
>> basic load-store instructions is much preferable to shortening the
>> length of the displacement even by one bit.
>>
>> John Savard
>
> Packing and unpacking DFP numbers does not take a lot of logic, assuming
> one of the common DPD packing methods. The number of registers handling
> DFP values could be doubled if they were unpacked and packed for each
> operation. Since DFP arithmetic has a high latency anyway, for example
> Q+ the DFP unit unpacks, performs the operation, then repacks the DFP
> number. So, registers only need be 128-bit.
>

In my case, had experimented with BCD instructions and DPD pack/unpack.
They were operating with 16 digits packed BCD (64-bits) or 15-digits in
50 bits (DPD). The ops could daisy-chain to support 32-digit or 48-digit
calculations.

Had the issue the there was lack of a compelling use case (to justify
the added cost).

Even with this, a format like Decimal128 is still going to be slower
than Binary128, and BCD ops lacked few other obvious compelling use-cases.

In practice, the main feature that these instructions added was the
realization that they could be used for faster Binary<->Decimal
conversion. But, even then, "potentially shaves some clock cycles off
the printf's" isn't all that compelling.


And, if one wants decimal floating point, something more akin to the
format that .NET had used makes more sense (using 32-bit chunks to
represent linear values in the range of 000000000..999999999).


> 256 bits seems a little narrow for a vector register. I have seen
> several other architectures with vector registers supporting 16+ 32-bit
> values, or a length of 512-bits. This is also the width of a typical
> cache line.
>

My case:
Narrow SIMD, 64-bit or 2x 64-bit.

Most data fits nicely in 2 or 4 element vectors, but is harder pressed
to make effective use of wider vectors unless one is effectively
SIMD'ing the SIMD operations.

Though, I guess some amount of vector stuff seems to try to present an
abstraction of looping over arrays, rather than say: "Here is a 3D
vector, calculate a dot or cross product, ..."


> Having the base register implicitly encoded in the instruction is a way
> to reduce the number of bits used to represent the base register. There
> seems to be a lot of different base register usages. Will not that make
> the compiler more difficult to write?
>

Yes. Short of 16-bit ops or similar, personally I would advise against
this sort of thing.

Better to have instructions that can access all of the registers at the
same time.


> Does array addressing mode have memory indirect addressing? It seems
> like a complex mode to support.
>

IME, the main address modes are:
(Rm, Disp) // ~ 66% +/- 10%
(Rm, Ro*FixSc) // ~ 33% +/- 10%
Where: FixSc matches the element size.
Pretty much everything else falls into the noise.

RISC-V only has the former, but kinda shoots itself in the foot:
GCC is good at eliminating most SP relative loads/stores;
That means, the nominal percentage of indexed is even higher...

As a result, the code is basically left doing excessive amounts of
shifts and adds, which (vs BJX2) effectively dethrone the memory
load/store ops for top-place.


Likewise, the moment one exceeds 12 bits on much of anything, RISC-V
also shoots itself in the foot. Because, not only has one hit the limits
of the ALU and LD/ST ops, there are no cheap fallbacks for intermediate
range constants.


If my compiler, with its arguably poor optimizer and barely functional
register allocation, is beating GCC for performance (when targeting
RISC-V), I don't really consider this a win for some of RISC-V's design
choices.

And, if GCC in its great wisdom, is mostly loading constants from memory
(having apparently offloaded most of them into the ".data" section),
this is also not a good sign.

Also, needing to use shift-pairs to sign and zero extend things is a bit
weak as well, ...

Though, theoretically, one can at least sign-extend 'int' with:
ADDIW Xd, Xs, 0


Another minor annoyances:
Bxx Rs, Rt, Disp //Compare two registers and branch
Is needlessly expensive (both for encoding space and logic cost).
Much of the time, Rs or Rt are X0 anyways.
Meanwhile:
Bxx Rs, Disp //Compare with zero and branch
Has much of the benefit, but at a lower cost.

So, say, "if(x>10)" goes from, say:
LI X8, 10
BGT X10, X8, label
To, say:
SLTI X8, X10, 11
BEQ X8, label


Also, as a random annoyance, RISC-V's instruction layout is very
difficult to decipher from a hexadecimal view. One basically needs to
dump it in binary to make it viable to mentally parse and lookup
instructions, which sucks.

I will count this one in BJX2's favor, in that it isn't quite suck a
horrid level of suck to mentally decode instructions presented in
hexadecimal form.


Granted, BJX2 has some design flaws as well.

But, as noted, in a "head to head" comparison BJX2 is seemingly holding
up fairly OK (despite my compiler's level of suck).

But, this is part of why I had kept putting RISC-V support on the back
shelf so much. Like, yes, it is a more popular ISA, and wasn't too hard
to support with my pipeline, but... It just kinda sucks as well...

Like, it isn't due to issues of lacking fancy features, so much as all
the areas where it "shoots itself in the foot" with more basic features.


> Block headers are tricky to use. They need to follow the output of the
> instructions in the assembler so that the assembler has time to generate
> the appropriate bits for the header. The entire instruction block needs
> to be flushed at the end of a function.
>

Agreed. Would not be in favor of block-headers or block structuring.
Linear instruction formats are preferable, preferably in 32-bit chunks.


Quadibloc

unread,
Jan 23, 2024, 4:00:06 PMJan 23
to
On Tue, 23 Jan 2024 13:56:32 -0600, BGB wrote:

> Agreed. Would not be in favor of block-headers or block structuring.
> Linear instruction formats are preferable, preferably in 32-bit chunks.

The good news is that, although Concertina III still has block structure,
it gives you a choice. The ISA is similar to a RISC architecture, but
with a number of added features, if you just use 32-bit instructions.

On Concertina II, you need to use block structure for:

- 17-bit instructions
- Immediate constants other than 8-bit or 16-bit
- Absolute array addresses
- Instruction prefixes
- Explicit indication of parallelism
- Instruction predication

On Concertina III, you need to use block structure for immediate constants other
than 8 bit, but the 16-bit instructions and the absolute array addresses are
available without block structure.

As it stands, Concertina III doesn't have instruction predication at all, which
is a deficiency I will need to see if I can remedy.

John Savard

Quadibloc

unread,
Jan 23, 2024, 4:27:42 PMJan 23
to
On Tue, 23 Jan 2024 21:00:01 +0000, Quadibloc wrote:

> the absolute array addresses are
> available without block structure.

No; they may not be in an alternate instruction set, but
they still are like pseudo-immediates, so they do need
the block structure.

John Savard

MitchAlsup1

unread,
Jan 23, 2024, 5:10:47 PMJan 23
to
BGB wrote:

> On 1/23/2024 6:06 AM, Robert Finch wrote:
>>

> IME, the main address modes are:
> (Rm, Disp) // ~ 66% +/- 10%
> (Rm, Ro*FixSc) // ~ 33% +/- 10%
> Where: FixSc matches the element size.
> Pretty much everything else falls into the noise.

With dynamically linked libraries one needs:: k is constant at link time

LD Rd,[IP,GOT[k]] // get a pointer to the external variable
and
CALX [IP,GOT[k]] // call external entry point

But now that you have the above you can easily get::

CALX [IP,Ri<<3,Table] // call indexed method
// can also be used for threaded JITs

> RISC-V only has the former, but kinda shoots itself in the foot:
> GCC is good at eliminating most SP relative loads/stores;
> That means, the nominal percentage of indexed is even higher...

A funny thing happens when you get rid of the "extra instructions"
most IRSC ISAs cause you to have in your instruction stream::
a) the number of instructions goes down
b) you get rid of the easy instructions
c) leaving all the complicated ones remaining

> As a result, the code is basically left doing excessive amounts of
> shifts and adds, which (vs BJX2) effectively dethrone the memory
> load/store ops for top-place.

These are the easy instructions that are not necessary when ISA is
properly conceived.

> Likewise, the moment one exceeds 12 bits on much of anything, RISC-V
> also shoots itself in the foot. Because, not only has one hit the limits
> of the ALU and LD/ST ops, there are no cheap fallbacks for intermediate
> range constants.

My 66000 has constants of all sizes for all instructions.

> If my compiler, with its arguably poor optimizer and barely functional
> register allocation, is beating GCC for performance (when targeting
> RISC-V), I don't really consider this a win for some of RISC-V's design
> choices.

When you benchmark against a strawman, cows get to eat.

> And, if GCC in its great wisdom, is mostly loading constants from memory
> (having apparently offloaded most of them into the ".data" section),
> this is also not a good sign.

Loading constants:
a) pollutes the data cache
b) wastes energy
c) wastes instructions

> Also, needing to use shift-pairs to sign and zero extend things is a bit
> weak as well, ...

See cows eat above.

>

> Also, as a random annoyance, RISC-V's instruction layout is very
> difficult to decipher from a hexadecimal view. One basically needs to
> dump it in binary to make it viable to mentally parse and lookup
> instructions, which sucks.

When you consume 3/4ths of the instruction space for 16-bit instructions;
you create stress in other areas of ISA>

Brian G. Lucas

unread,
Jan 23, 2024, 7:11:22 PMJan 23
to
On 1/23/24 16:10, MitchAlsup1 wrote:
>
> When you benchmark against a strawman, cows get to eat.

Not a farm boy I'll bet. Cows eat hay, but not straw.

brian

Chris M. Thomasson

unread,
Jan 23, 2024, 7:46:05 PMJan 23
to

BGB

unread,
Jan 24, 2024, 12:59:04 AMJan 24
to
On 1/23/2024 4:10 PM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 1/23/2024 6:06 AM, Robert Finch wrote:
>>>
>
>> IME, the main address modes are:
>>    (Rm, Disp)       // ~ 66%  +/- 10%
>>    (Rm, Ro*FixSc)   // ~ 33%  +/- 10%
>>      Where: FixSc matches the element size.
>> Pretty much everything else falls into the noise.
>
> With dynamically linked libraries one needs:: k is constant at link time
>
>     LD    Rd,[IP,GOT[k]]     // get a pointer to the external variable
> and
>     CALX  [IP,GOT[k]]        // call external entry point
>
> But now that you have the above you can easily get::
>
>     CALX  [IP,Ri<<3,Table]   // call indexed method
>                              // can also be used for threaded JITs
>

These are unlikely to be particularly common cases *except* if using a
GOT or similar. However, if one does not use a GOT, then this is less of
an issue.


Granted, this does mean if importing variables is supported, yes, it
will come with a penalty. It is either this or add a mechanism where one
can use an absolute addressing mode and then fix-up every instance of
the variable during program load.

Say:
MOV Abs64, R4
MOV.Q (R4), R8

Though, neither ELF nor PE/COFF has a mechanism for doing this.

Not currently a huge issue, as this would first require the ability to
import/export variables in DLLs.


>> RISC-V only has the former, but kinda shoots itself in the foot:
>>    GCC is good at eliminating most SP relative loads/stores;
>>    That means, the nominal percentage of indexed is even higher...
>
> A funny thing happens when you get rid of the "extra instructions"
> most IRSC ISAs cause you to have in your instruction stream::
> a) the number of instructions goes down
> b) you get rid of the easy instructions
> c) leaving all the complicated ones remaining
>

Possibly.
RISC-V is at a stage where execution is dominated by ALU ops;
BJX2 is at a stage where it is mostly dominated by memory Load/Store.

Being Ld/St bound seems like it would be worse, but part of this is
because it isn't burning quite so many ALU instructions on things like
address calculations.


Technically, part of the role had been moved over to LEA, but the LEA
ops are a bit further down the ranking.


>> As a result, the code is basically left doing excessive amounts of
>> shifts and adds, which (vs BJX2) effectively dethrone the memory
>> load/store ops for top-place.
>
> These are the easy instructions that are not necessary when ISA is
> properly conceived.
>

Yeah.


>> Likewise, the moment one exceeds 12 bits on much of anything, RISC-V
>> also shoots itself in the foot. Because, not only has one hit the
>> limits of the ALU and LD/ST ops, there are no cheap fallbacks for
>> intermediate range constants.
>
> My 66000 has constants of all sizes for all instructions.
>

At present:
BJX2: 9u ALU and LD/ST, 10u/10s in XG2
Though, scaled 9u can give 2K / 4K for L/Q.
The Disp10s might have been better in retrospect as 10u.
RV64: 12s, unscaled for LD/ST
This gives a slight advantage for ALU ops in RV64.
BJX2:
Can load Imm17s into R0-R31 in Baseline, R0..R63 in XG2;
Can load Imm25s into R0.
RV64:
No single-op option larger than 12-bits;
LUI and AUIPC don't really count here.


RV64 can encode 32 bit constant in a 2-op sequence;
BJX2 can encode an arbitrary 33-bit immed with a 64-bit encoding, or a
64-bit constant in a 96-bit encoding.

RV64IMA has no way to encode a 64-bit constant in fewer than 6 ops.

Seems like GCC's solution to a lot of this is "yeah, just use memory
loads for everything" (though still using 2-op sequences for PC-relative
address generation).


>> If my compiler, with its arguably poor optimizer and barely functional
>> register allocation, is beating GCC for performance (when targeting
>> RISC-V), I don't really consider this a win for some of RISC-V's
>> design choices.
>
> When you benchmark against a strawman, cows get to eat.
>

Yeah.

Would probably be a somewhat different situation against a similar
clocked ARMv8 core.


Though, some people were claiming that RISC-V can match ARMv8
performance?...

I would expect ARMv8 to beat RV64 for similar reasons to how BJX2 can
beat RV64, but with ARMv8 also having the advantage of a more capable
compiler.


Then again, I can note that generally BGBCC also uses stack canaries:
On function entry, it puts a magic number of the stack;
On function return, it reads the value and makes sure it is intact,
if not intact, it triggers a breakpoint.



Well, also some boilerplate tasks:
Saving/reloading GBR, and going through a ritual to reload GBR as-needed
(say, in case the function is called from somewhere where GBR was set up
for a different program image);
Also uses an instruction that enables/disables WEX support in the CPU
based on the requested WEX profile;
...

There was also some amount of optional boilerplate (per function) to
facilitate exception unwinding (and the possibility of using try/catch
blocks). But, I am generally disabling this on the command-line
("-fnoexcept") as it is N/A for C. If enabled, every function needs this
boilerplate, or else it will not be possible to unwind through these
stack-frames on an exception.


These things eat a small amount of code-space and clock-cycles,
generally GCC doesn't seem to do any of this.

I am guessing also maybe it has some other way to infer that it doesn't
need to have exception-unwinding for plain C programs?...




>> And, if GCC in its great wisdom, is mostly loading constants from
>> memory (having apparently offloaded most of them into the ".data"
>> section), this is also not a good sign.
>
> Loading constants:
> a) pollutes the data cache
> b) wastes energy
> c) wastes instructions
>

Yes.

But, I guess it does improve code density in this case... Because the
constants are "somewhere else" and thus don't contribute to the size of
'.text'; the program just puts a few kB worth of constants into '.data'
instead...

Does make the code density slightly less impressive.

Granted, one can argue the same of prolog/epilog compression in my case:
Save some space on prolog/epilog by calling or branching to prior
versions (since the code to save and restore GPRs is fairly repetitive).


>> Also, needing to use shift-pairs to sign and zero extend things is a
>> bit weak as well, ...
>
> See cows eat above.
>
>>
>
>> Also, as a random annoyance, RISC-V's instruction layout is very
>> difficult to decipher from a hexadecimal view. One basically needs to
>> dump it in binary to make it viable to mentally parse and lookup
>> instructions, which sucks.
>
> When you consume 3/4ths of the instruction space for 16-bit instructions;
> you create stress in other areas of ISA>

BJX2 Baseline originally burned 7/8 of the encoding space for for 16-bit
ops.

For XG2, this space was reclaimed, generally for:
Expand register fields to 6-bits;
Expand Disp and Imm fields;
Imm9/Disp9 -> Imm10/Disp10 (3RI)
Imm10 -> Imm12 (2RI).
Expand BRA/BSR from 20 to 23 bits.
IOW: XG2 now has +/- 8MB for branch ops.
...


Bigger difference I think for mental decoding has to do with how bits
were organized. Most things were organized around a 4-bit nybbles, and
immediate fields are mostly contiguous, and still organized around 4-bit
nybbles. Result is generally that it is much easier to visually match
the opcode and extract the register fields.

With RISC-V, absent dumping the whole instruction in binary, this is
very difficult.


This was a bit painful when trying to debug Doom booting in RISC-V mode
in my Verilog core vis "$display()" statements.

But, luckily, did at least eventually get it working.
So, at least to the limited extent of being able to boot directly into
Doom and similar, RISC-V mode does seem to be working...


BGB

unread,
Jan 24, 2024, 3:15:28 AMJan 24
to
Hmm, if I were to try to design something "new", optimizing for current
thoughts / observations.


Current leaning, say (flat register space):
64x 64-bit GPRs.
R0: ZR / PC(Ld/St)
R1: LR / TP(Ld/St)
R2: SP
R3: GP
R4 ..R7: Scratch, A0..A3
R8 ..R15: Preserve
R16..R19: Scratch
R16/R17, Return Value, addr for struct return.
R18: 'this'
R19: LR2 (prolog/epilog compression)
R20..R23: Scratch, A4..A7
R24..R31: Preserve
R32..R35: Scratch
R36..R39: Scratch, A8..A11 ?
R40..R47: Preserve
R48..R51: Scratch
R52..R55: Scratch, A12..A15 ?
R56..R63: Preserve

While not perfect, this would limit changes vs my existing ABI.

This was a roadblock towards trying to add RISC-V support to BGBCC, as
the ABI is different enough that it would require significant changes to
the backend.

ZR (Zero) and LR would be visible to "Normal" ops, but not as a
base-address for Load/Store, where they would be reinterpreted as PC and
TP (Task Pointer), where TP is assumed read-only for usermode programs.



With 32-bit instruction words, say:
ppzz-zzzz-zznn-nnnn-iiii-iiii-iiii-iiii // 2RI Imm16
ppzz-zzzz-zznn-nnnn-ssss-ssii-iiii-iiii // 3RI Imm10
ppzz-zzzz-zznn-nnnn-zzzz-iiii-iiii-iiii // 2RI Imm12
ppzz-zzzz-zznn-nnnn-ssss-sstt-tttt-zzzz // 3R (? 3RI Imm6)
ppzz-zzzz-zznn-nnnn-ssss-sszz-zzzz-zzzz // 2R
ppzz-zzzz-iiii-iiii-iiii-iiii-iiii-iiii // Imm24

pp:
00: PredT
01: PredF
10: Scalar
11: WEX

There would be a 'T' status bit, which would be designated exclusively
for predication.


pp00:
pp00-zzzz-zznn-nnnn-ssss-sstt-tttt-zzzz // 3R Space

pp00-0000-zznn-nnnn-ssss-sstt-tttt-0000 // LDS{B/W/L/Q} (Rs, Rt)
pp00-0000-zznn-nnnn-ssss-sstt-tttt-0001 // LDU{B/W/L/Q} (Rs, Rt)
pp00-0000-zznn-nnnn-ssss-sstt-tttt-0010 // ST{B/W/L/Q} (Rs, Rt)
pp00-0000-zznn-nnnn-ssss-sstt-tttt-0011 // LEA{B/W/L/Q} (Rs, Rt)

Note that Rs would always be the base register for load/store, for
stores, Rn would serve as the value-source, for loads the destination.
Here, Rt would serve as an index, scaled by the access size.


pp00-0001-zznn-nnnn-ssss-sstt-tttt-000z // ALU
ADD/SUB/SHAD/SHLD/MUL/AND/OR/XOR
pp00-0001-zznn-nnnn-ssss-sstt-tttt-001z // ALU
ADDSL/SUBSL/MULSL/SHADL/ADDUL/SUBUL/MULUL/SHLDL

pp01:
pp01-zzzz-zznn-nnnn-ssss-ssii-iiii-iiii // LD/ST Disp10
(Will likely assume scaled zero-extended LD/ST displacements)

Could maybe provide Imm6n LD/ST ops for a limited range of negative
displacements, but negative displacements are rarely used in general
(and typically much smaller than positive displacements).


pp10:
pp10-zzzz-zznn-nnnn-ssss-ssii-iiii-iiii // ALU Imm10
(Most ALU ops will have zero-extended immediate values)

pp10-000z-zznn-nnnn-ssss-ssii-iiii-iiii // ALU Rs, Imm10, Rn
ADD/SUB/SHAD/SHLD/MUL/AND/OR/XOR

pp10-001z-zznn-nnnn-ssss-ssii-iiii-iiii // ALU Rs, Imm10, Rn
ADDSL/SUBSL/MULSL/-/ADDUL/SUBUL/MULUL/-


pp11-0zzz:
pp11-0zzz-zznn-nnnn-zzzz-iiii-iiii-iiii // 2RI Imm12
(Some ops may be effectively Imm13s)

pp11-10zz:
pp11-10zz-zznn-nnnn-ssss-sszz-zzzz-zzzz // 2R

pp11-110z:
pp11-1100-00nn-nnnn-iiii-iiii-iiii-iiii // LI Imm16u, Rn //0..65535
pp11-1100-01nn-nnnn-iiii-iiii-iiii-iiii // LI Imm16n, Rn //-65536..-1
pp11-1100-10nn-nnnn-iiii-iiii-iiii-iiii // ADD Imm16u, Rn
pp11-1100-11nn-nnnn-iiii-iiii-iiii-iiii // ADD Imm16n, Rn
pp11-1101-00nn-nnnn-iiii-iiii-iiii-iiii // FLDCH Imm16u, Rn //Fp16
pp11-1101-01nn-nnnn-iiii-iiii-iiii-iiii // ? LEA.Q (GP, Imm16u), Rn
pp11-1101-10nn-nnnn-iiii-iiii-iiii-iiii // -
pp11-1101-11nn-nnnn-iiii-iiii-iiii-iiii // -

pp11-111z:
0011-1110-iiii-iiii-iiii-iiii-iiii-iiii // BT Disp24
0111-1110-iiii-iiii-iiii-iiii-iiii-iiii // BF Disp24
1011-1110-iiii-iiii-iiii-iiii-iiii-iiii // BRA Disp24
1111-1110-iiii-iiii-iiii-iiii-iiii-iiii // Jumbo-Imm

0011-1111-iiii-iiii-iiii-iiii-iiii-iiii // -
0111-1111-iiii-iiii-iiii-iiii-iiii-iiii // -
1011-1111-iiii-iiii-iiii-iiii-iiii-iiii // BSR Disp24
1111-1111-iiii-iiii-iiii-iiii-iiii-iiii // Jumbo-Op


This would sacrifice a few cases that exist in BJX2, but had mostly
fallen into disuse as a consequence of the existence of Jumbo prefixes.

Here, one can assume that Jumbo prefixes will exist, so the relative
loss of not having a dedicated "load 24 bits into a fixed register" case
is less.


Would also assume jumbo prefixes will deal with things like loading
function pointer addresses, etc.


...



> John Savard

Scott Lurndal

unread,
Jan 24, 2024, 9:45:39 AMJan 24
to
Although a strawman can be made from hay or leaves and twigs, or any
other stuffing, straw, as a waste product from grain production,
is traditional.

MitchAlsup1

unread,
Jan 24, 2024, 3:25:47 PMJan 24
to
BGB wrote:

> On 1/23/2024 4:10 PM, MitchAlsup1 wrote:
>>
>>> Likewise, the moment one exceeds 12 bits on much of anything, RISC-V
>>> also shoots itself in the foot. Because, not only has one hit the
>>> limits of the ALU and LD/ST ops, there are no cheap fallbacks for
>>> intermediate range constants.
>>
>> My 66000 has constants of all sizes for all instructions.
>>
------------------------
>>> And, if GCC in its great wisdom, is mostly loading constants from
>>> memory (having apparently offloaded most of them into the ".data"
>>> section), this is also not a good sign.
>>
>> Loading constants:
>> a) pollutes the data cache
>> b) wastes energy
>> c) wastes instructions
>>

> Yes.

> But, I guess it does improve code density in this case... Because the
> constants are "somewhere else" and thus don't contribute to the size of
> '.text'; the program just puts a few kB worth of constants into '.data'
> instead...

Consider the store of a constant to a constant address::

array[7] = bigFPconstant;

RISC-V
text
aupic Ra,high(&bigFPconstant)
ldd Rd,[Ra+low(&bigFPconstant)]
aupic Ra,high(&array+48)
std Rd,[Ra+low(&array+48)]
data
double bigFPconstant

4 instructions 6 words of memory 2 registers

My 66000:
STD #bigFPconstant,[IP,,&array+48]

1 instruction 4 words of memory all in .text 0 registers

Also note: RISC-V has no real way to support 64-bit displacements other
than resorting to LDs of pointers (ala GOT and similar).

> Does make the code density slightly less impressive.

> Granted, one can argue the same of prolog/epilog compression in my case:
> Save some space on prolog/epilog by calling or branching to prior
> versions (since the code to save and restore GPRs is fairly repetitive).

ENTER and EXIT eliminate the additional control transfers and can allow
FETCH of the return address to start before the restores are finished.

BGB

unread,
Jan 25, 2024, 2:06:28 AMJan 25
to
This scenario would be two instructions in my case.

I suspect the situation isn't quire *that* bad for RISC-V, mostly
because from the instruction dumps, it looks like it lumps constants
together into tables and then loads them from the table, able to use a
shared based register (and, in some cases, GP).

Say:
GP is initialized to 2K past the start of '.data';
Seems to cluster common constants at negative addresses relative to GP,
common local variables at positive addresses.

Then seemingly falls back to AUIPC+LD/ST outside of +/- 2K, with other
constants being held in tables (maybe GOT, hard to really tell from
disassembly, or looking at the back-track in machine-code).

Or, at least, this is how it seemed to work when debugging stuff.


But, yeah, looks like, besides adding indexed load/store to a
"wishlist", something like a 17-bit constant load would also be a high
priority.


From my own possible extension list:

* 00110ss-ooooo-mmmmm-ttt-nnnnn-01-01111 Lt Rn, (Rm, Ro)
** 00110ss-ttttt-mmmmm-000-nnnnn-01-01111 ? LB Rn, (Rm, Rt*Sc)
** 00110ss-ttttt-mmmmm-001-nnnnn-01-01111 ? LH Rn, (Rm, Rt*Sc)
** 00110ss-ttttt-mmmmm-010-nnnnn-01-01111 ? LW Rn, (Rm, Rt*Sc)
** 00110ss-ttttt-mmmmm-011-nnnnn-01-01111 ? LD Rn, (Rm, Rt*Sc)
** 00110ss-ttttt-mmmmm-100-nnnnn-01-01111 ? LBU Rn, (Rm, Rt*Sc)
** 00110ss-ttttt-mmmmm-101-nnnnn-01-01111 ? LHU Rn, (Rm, Rt*Sc)
** 00110ss-ttttt-mmmmm-110-nnnnn-01-01111 ? LWU Rn, (Rm, Rt*Sc)
** 00110ss-ttttt-mmmmm-111-nnnnn-01-01111 ? LX Rn, (Rm, Rt*Sc)

* 00111ss-ooooo-mmmmm-ttt-nnnnn-01-01111 St (Rm, Ro), Rn
** 00110ss-ttttt-mmmmm-000-nnnnn-01-01111 ? SB (Rm, Rt*Sc), Rn
** 00110ss-ttttt-mmmmm-001-nnnnn-01-01111 ? SH (Rm, Rt*Sc), Rn
** 00110ss-ttttt-mmmmm-010-nnnnn-01-01111 ? SW (Rm, Rt*Sc), Rn
** 00110ss-ttttt-mmmmm-011-nnnnn-01-01111 ? SD (Rm, Rt*Sc), Rn
** 00110ss-ttttt-mmmmm-100-nnnnn-01-01111 ? SBU (Rm, Rt*Sc), Rn
** 00110ss-ttttt-mmmmm-101-nnnnn-01-01111 ? SHU (Rm, Rt*Sc), Rn
** 00110ss-ttttt-mmmmm-110-nnnnn-01-01111 ? SWU (Rm, Rt*Sc), Rn
** 00110ss-ttttt-mmmmm-111-nnnnn-01-01111 ? SX (Rm, Rt*Sc), Rn

Shoved into a hole in the AMO space.


* iiiiiii-iiiii-iiiii-111-nnnnn-00-11011 ? LI Rn, Imm17s
// In the space ANDIW would have existed in, if it existed.

Granted, the harder part here is confirming which encodings may or may
not be in use, as there doesn't seem to be any public opcode list or
registry.


> Also note: RISC-V has no real way to support 64-bit displacements other
> than resorting to LDs of pointers (ala GOT and similar).
>

Yeah.
It has ended up in a situation where GOT is seemingly the "best" option.


>> Does make the code density slightly less impressive.
>
>> Granted, one can argue the same of prolog/epilog compression in my case:
>> Save some space on prolog/epilog by calling or branching to prior
>> versions (since the code to save and restore GPRs is fairly repetitive).
>
> ENTER and EXIT eliminate the additional control transfers and can allow
> FETCH of the return address to start before the restores are finished.

Possible, but branches are cheaper to implement in hardware, and would
have been implemented already...


Granted, it is a similar thing to the recent addition of a memcpy()
slide for intermediate-sized memcpy.

Where, if one expresses the slide in reverse order, copying any multiple
of N bytes can be expressed as a branch into the slide (with less
overhead than a loop).


But, I guess in theory, the memcpy slide could be implemented in plain C
with a switch.
uint64_t *dst, *src;
uint64_t li0, li1, li2, li3;
... copy final bytes ...
switch(sz>>5)
{
...
case 2:
li0=src[4]; li1=src[5];
li2=src[6]; li3=src[7];
dst[4]=li0; dst[5]=li1;
dst[6]=li2; dst[7]=li3;
case 1:
li0=src[0]; li1=src[1];
li2=src[2]; li3=src[3];
dst[0]=li0; dst[1]=li1;
dst[2]=li2; dst[3]=li3;
case 0:
break;
}

Like, in theory one could have a special hardware feature, but a plain
software solution is reasonably effective.


MitchAlsup1

unread,
Jan 25, 2024, 12:23:09 PMJan 25
to
BGB wrote:

> On 1/24/2024 2:23 PM, MitchAlsup1 wrote:
>> BGB wrote:
>>
>>> Granted, one can argue the same of prolog/epilog compression in my case:
>>> Save some space on prolog/epilog by calling or branching to prior
>>> versions (since the code to save and restore GPRs is fairly repetitive).
>>
>> ENTER and EXIT eliminate the additional control transfers and can allow
>> FETCH of the return address to start before the restores are finished.

> Possible, but branches are cheaper to implement in hardware, and would
> have been implemented already...

Are you intentionally misreading what I wrote ??

There is a se

MitchAlsup1

unread,
Jan 25, 2024, 12:33:21 PMJan 25
to
BGB wrote:

> On 1/24/2024 2:23 PM, MitchAlsup1 wrote:
>> BGB wrote:
>>
>>> Granted, one can argue the same of prolog/epilog compression in my case:
>>> Save some space on prolog/epilog by calling or branching to prior
>>> versions (since the code to save and restore GPRs is fairly repetitive).
>>
>> ENTER and EXIT eliminate the additional control transfers and can allow
>> FETCH of the return address to start before the restores are finished.

> Possible, but branches are cheaper to implement in hardware, and would
> have been implemented already...

Are you intentionally misreading what I wrote ??

Epilogue is a sequence of loads leading to a jump to the return address.

Your ISA cannot jump to the return address while performing the loads
so FETCH does not get the return address and can't start fetching
instructions until the jump is performed.

Because the entire Epilogue is encapsulated in EXIT, My 66000 can LD
the return address from the stack and fetch the instructions at the
return address while still loading the preserved registers (that were
saved) so that the instructions are ready for execution by the time
the last LD is performed.

In addition, If one is performing an EXIT and fetch runs into a CALL;
it can fetch the Called address and if there is an ENTER instruction
there, it can cancel the remainder of EXIT and cancel some of ENTER
because the preserved registers are already on the stack where they are
supposed to be.

Doing these with STs and LDs cannot save those cycles.

> Granted, it is a similar thing to the recent addition of a memcpy()
> slide for intermediate-sized memcpy.

> Where, if one expresses the slide in reverse order, copying any multiple
> of N bytes can be expressed as a branch into the slide (with less
> overhead than a loop).


> But, I guess in theory, the memcpy slide could be implemented in plain C
> with a switch.
> uint64_t *dst, *src;
> uint64_t li0, li1, li2, li3;
> ... copy final bytes ...
> switch(sz>>5)
> {
> ...
> case 2:
> li0=src[4]; li1=src[5];
> li2=src[6]; li3=src[7];
> dst[4]=li0; dst[5]=li1;
> dst[6]=li2; dst[7]=li3;
> case 1:
> li0=src[0]; li1=src[1];
> li2=src[2]; li3=src[3];
> dst[0]=li0; dst[1]=li1;
> dst[2]=li2; dst[3]=li3;
> case 0:
> break;
> }

Looks like Duff's device.

But why not just::

MM Rto,Rfrom,Rcount

BGB

unread,
Jan 25, 2024, 2:12:20 PMJan 25
to
On 1/25/2024 11:26 AM, MitchAlsup1 wrote:
> BGB wrote:
>
>> On 1/24/2024 2:23 PM, MitchAlsup1 wrote:
>>> BGB wrote:
>>>
>>>> Granted, one can argue the same of prolog/epilog compression in my
>>>> case:
>>>> Save some space on prolog/epilog by calling or branching to prior
>>>> versions (since the code to save and restore GPRs is fairly
>>>> repetitive).
>>>
>>> ENTER and EXIT eliminate the additional control transfers and can allow
>>> FETCH of the return address to start before the restores are finished.
>
>> Possible, but branches are cheaper to implement in hardware, and would
>> have been implemented already...
>
> Are you intentionally misreading what I wrote ??
>

?? I don't understand.



> Epilogue is a sequence of loads leading to a jump to the return address.
>
> Your ISA cannot jump to the return address while performing the loads
> so FETCH does not get the return address and can't start fetching
> instructions until the jump is performed.
>

You can put the load for the return address before the other loads.
Then, if the epilog is long enough (so that this load is no-longer in
flight once it hits the final jump), the branch-predictor will lead to
it start loading the post-return instructions before the jump is reached.

This is likely a non-issue as I see it.


It is only really an issue if one demands that reloading the return
address be done as one of the final instructions in the epilog, and not
one of the first instructions.


Granted, one would have to do it as one of the final ops, if it were
implemented as a slide, but it is not. There are "practical reasons" why
a slide would not be a workable strategy in this case.

So, generally, these parts of the prolog/epilog sequences are emitted
for every combination of saved/restored registers that had been encountered.

Though, granted, when used, does mean that any such function needs to
effectively two two sets of stack-pointer adjustments:
One set for the save/restore area (in the reused part);
One part for the function (for its data and local/temporary variables
and similar).


> Because the entire Epilogue is encapsulated in EXIT, My 66000 can LD
> the return address from the stack and fetch the instructions at the
> return address while still loading the preserved registers (that were
> saved) so that the instructions are ready for execution by the time
> the last LD is performed.
>
> In addition, If one is performing an EXIT and fetch runs into a CALL;
> it can fetch the Called address and if there is an ENTER instruction
> there, it can cancel the remainder of EXIT and cancel some of ENTER
> because the preserved registers are already on the stack where they are
> supposed to be.
>
> Doing these with STs and LDs cannot save those cycles.
>

I don't see why not, the branch-predictor can still do its thing
regardless of whether or not LD/ST ops were used.

And, having the instructions in the pipeline a few cycles earlier will
buy nothing if they still can't execute until after the data is reloaded.


Similarly, can't go much wider than the existing 128-bit load stores
absent adding more register ports, so...


The main thing something like ENTER/EXIT could save would be some code
space.
Kinda, but generally without the loop and egregious abuse of C syntax.

Would get kinda bulky to express 1K or so worth of memory copy as a big
"switch()" block.

For anything past a certain size limit, will need to use a loop though.


> But why not just::
>
>       MM    Rto,Rfrom,Rcount
>

Would need special hardware support for this (namely, hardware to fake a
series of loads/stores in the pipeline).

Potentially burning a few K of code-space for a big copy-slide is at
least a reasonable tradeoff in that no special hardware facilities are
needed.

Partly, as there needs to be two sets of copy-slides:
One that deals with aligned copy;
One that can deal with unaligned copy.


Though, generally not used for size-optimized binaries, since here size
is the priority (and always using a loop-based generic copy, is smaller).

Chris M. Thomasson

unread,
Jan 25, 2024, 2:21:36 PMJan 25
to
Indeed. Fwiw, I will never forget when I overheard a farmer talk about
how some of his fence lines were infested with Jimsonweed.

MitchAlsup1

unread,
Jan 25, 2024, 4:26:01 PMJan 25
to
BGB wrote:

> On 1/25/2024 11:26 AM, MitchAlsup1 wrote:
>> BGB wrote:
>>
>>> On 1/24/2024 2:23 PM, MitchAlsup1 wrote:
>>>> BGB wrote:
>>>>
>>>>> Granted, one can argue the same of prolog/epilog compression in my
>>>>> case:
>>>>> Save some space on prolog/epilog by calling or branching to prior
>>>>> versions (since the code to save and restore GPRs is fairly
>>>>> repetitive).
>>>>
>>>> ENTER and EXIT eliminate the additional control transfers and can allow
>>>> FETCH of the return address to start before the restores are finished.
>>
>>> Possible, but branches are cheaper to implement in hardware, and would
>>> have been implemented already...
>>
>> Are you intentionally misreading what I wrote ??
>>

> ?? I don't understand.



>> Epilogue is a sequence of loads leading to a jump to the return address.
>>
>> Your ISA cannot jump to the return address while performing the loads
>> so FETCH does not get the return address and can't start fetching
>> instructions until the jump is performed.
>>

> You can put the load for the return address before the other loads.
> Then, if the epilog is long enough (so that this load is no-longer in
> flight once it hits the final jump), the branch-predictor will lead to
> it start loading the post-return instructions before the jump is reached.

Yes, you can read RA early.
What you cannot do is JMP early so the FETCH stage fetches instructions
at return address early.
{{If you JMP early, then the rest of the LDs won't happen}}

> This is likely a non-issue as I see it.

> It is only really an issue if one demands that reloading the return
> address be done as one of the final instructions in the epilog, and not
> one of the first instructions.

I make no such demand--I merely demand the JMP RA is the last instruction.

> Granted, one would have to do it as one of the final ops, if it were
> implemented as a slide, but it is not. There are "practical reasons" why
> a slide would not be a workable strategy in this case.

> So, generally, these parts of the prolog/epilog sequences are emitted
> for every combination of saved/restored registers that had been encountered.

> Though, granted, when used, does mean that any such function needs to
> effectively two two sets of stack-pointer adjustments:
> One set for the save/restore area (in the reused part);
> One part for the function (for its data and local/temporary variables
> and similar).


>> Because the entire Epilogue is encapsulated in EXIT, My 66000 can LD
>> the return address from the stack and fetch the instructions at the
>> return address while still loading the preserved registers (that were
>> saved) so that the instructions are ready for execution by the time
>> the last LD is performed.
>>
>> In addition, If one is performing an EXIT and fetch runs into a CALL;
>> it can fetch the Called address and if there is an ENTER instruction
>> there, it can cancel the remainder of EXIT and cancel some of ENTER
>> because the preserved registers are already on the stack where they are
>> supposed to be.
>>
>> Doing these with STs and LDs cannot save those cycles.
>>

> I don't see why not, the branch-predictor can still do its thing
> regardless of whether or not LD/ST ops were used.

Consider::

main:
...
CALL funct1
CALL funct2

funct2:
SUB Sp,SP,stackArea2
ST R0,[SP,offset20]
ST R0,[SP,offset20]
ST R30,[SP,offset230]
ST R29,[SP,offset229]
ST R28,[SP,offset228]
ST R27,[SP,offset227]
ST R26,[SP,offset226]
ST R25,[SP,offset225]
...

funct1:
...
LD R0,[SP,offset10]
LD R30,[SP,offset130]
LD R29,[SP,offset129]
LD R28,[SP,offset128]
LD R27,[SP,offset127]
LD R26,[SP,offset126]
LD R25,[SP,offset125]
LD R24,[SP,offset124]
LD R23,[SP,offset123]
LD R22,[SP,offset122]
LD R21,[SP,offset121]
ADD SP,SP,stackArea1
JMP R0

The above would have to observe that all offset1's are equal to all
offset2's in order to short circuit the data movements. A single::

LD R26,[SP,someotheroffset]

ruins the short circuit.

Whereas:

funct2:
ENTER R25,R0,stackArea2
...

funct1:
...
EXIT R21,R0,stackArea1

will have registers R0,R25..R30 in the same positions on the stack
guaranteed by ISA definition!!

BGB

unread,
Jan 25, 2024, 9:09:54 PMJan 25
to
In my case, both LR and R1 are forwarded to the branch-predictor via
side-channels, so the values are visible as soon as they cross the WB stage.

Once this happens, they can be predicted in the same way as normal
constant-displacement branches (IOW: it can see through the "RTS" or
"JMP R1" instruction).


This is N/A if using a different register.
In RV64 Mode, LR is mapped to X1 and R1/DHR to X5.

So, theoretically the same optimization can be used for RV64, though at
the moment, the branch predictor doesn't yet match RV instructions.


Note that this does not effect performance estimates via my emulator,
which had assumed the RV branches would be branch predicted (though, in
the Verilog core, at present the actual RV code will run slower than the
emulator predicts...).



As I look at Doom running in the Verilog simulation and can observe that
for RISC-V at the moment it is running at roughly 8-11 fps...
Well, and with a lot of sprites going on, 5 fps.

So, I have RV64 running in the Verilog simulation, but it appears to be
performing a bit worse than my emulator predicts.

TBD how much has to do with a current lack of RV support in the branch
predictor.
OK.

This would be something other than a branch-predictor concern.

In the return case, all the branch predictor cares about is whether LR
or R1 is still in-flight at the moment the "RTS" or "JMP R1" is
encountered, but need not pattern-match the Loads/Stores to get there.

MOV.X instruction also saves/restores 2 registers, doesn't care about
what happens with the values, or how it relates to other instructions.

I guess, If one wanted to pattern match two MOV.X's, say, into a "MOV.Y"
(hypothetical 256-bit LD/ST), one would care that the offsets and
registers pair up.


This isn't currently done (since I don't have the register ports for this).

I guess it could be possible to detect this case for MOV.Q pairs and
effectively merge them into MOV.X operation. Similar for LD pairs.

But, pattern matching instructions (AKA: "fusion") won't be cheap
either. For now though, I will ignore this possibility.



Quadibloc

unread,
Jan 26, 2024, 3:21:08 AMJan 26
to
On Tue, 23 Jan 2024 09:50:29 +0000, Quadibloc wrote:

> I have indeed decided that using three base registers for the
> basic load-store instructions is much preferable to shortening the
> length of the displacement even by one bit.

Another change has been made to Concertina III, based on the work
done for Concertina IV. The instruction prefix has been eliminated
as a possible meaning of the header word; instead, instruction
predication can be specified by the header.

John Savard

Robert Finch

unread,
Jan 26, 2024, 11:58:29 AMJan 26
to
I like the ENTER / EXIT instructions and safe stack idea, and have
incorporated them into Q+ called ENTER and LEAVE. EXIT makes me think of
program exit(). They can improve code density. I gather that the stack
used for ENTER and EXIT is not the same stack as is available for the
rest of the app. This means managing two stack pointers, the regular
stack and the safe stack. Q+ could have the safe stack pointer as a
register that is not even accessible by the app and not part of the GPR
file.

For ENTER/LEAVE Q+ has the number of registers to save specified as a
four-bit number and saves only the saved registers, link register and
frame pointer according to the ABI. So, “ENTER 3,64” will save s0 to s2,
the frame-pointer, link register and allocate 64 bytes plus the return
block on the stack. The return block contains the frame-pointer, link
register and two slots that are zeroed out intended for exception
handlers. The saved registers are limited to s0 so s9.

Q+ also has a PUSHA / POPA instructions to push or pop all the
registers, meant for interrupt handlers. PUSH and POP instructions by
themselves can push or pop up to five registers.

Some thought has been given towards modifying ENTER and LEAVE to support
interrupt handlers, rather than have separate PUSHA / POPA instructions.
ENTER 15,0 would save all the registers, and LEAVE 15,0 would restore
them all and return using an interrupt return.

MitchAlsup1

unread,
Jan 26, 2024, 4:35:49 PMJan 26
to
Robert Finch wrote:

> On 2024-01-25 4:25 p.m., MitchAlsup1 wrote:
>>
>>
>> Whereas:
>>
>> funct2:
>>      ENTER   R25,R0,stackArea2
>>      ...
>>
>> funct1:
>>      ...
>>      EXIT    R21,R0,stackArea1
>>
>> will have registers R0,R25..R30 in the same positions on the stack
>> guaranteed by ISA definition!!

> I like the ENTER / EXIT instructions and safe stack idea, and have
> incorporated them into Q+ called ENTER and LEAVE. EXIT makes me think of
> program exit(). They can improve code density. I gather that the stack
> used for ENTER and EXIT is not the same stack as is available for the
> rest of the app. This means managing two stack pointers, the regular
> stack and the safe stack. Q+ could have the safe stack pointer as a
> register that is not even accessible by the app and not part of the GPR
> file.

LEAVE has older x86 connotations, so I used a different word.

Registers R16..R31 go on the safe stack (when enabled) SSP
Registers R01..R15 go on the regular stack SP

When safe stack is enabled, Return Address goes directly on safe stack
without passing through R0; and comes off of safe-stack without passing
through R0.

SSP requires privilege to access.
The safe stack pages are required to have RWE = 3'B000 rights; so SW
cannot read or write these containers directly or indirectly.

> For ENTER/LEAVE Q+ has the number of registers to save specified as a
> four-bit number and saves only the saved registers, link register and
> frame pointer according to the ABI. So, “ENTER 3,64” will save s0 to s2,
> the frame-pointer, link register and allocate 64 bytes plus the return
> block on the stack. The return block contains the frame-pointer, link
> register and two slots that are zeroed out intended for exception
> handlers. The saved registers are limited to s0 so s9.

I specify start and stop registers in ENTER and EXIT. In addition the
16-bit immediate field is used to allocate/deallocate space other than
the save/restored registers. Since the stack is always doubleword
aligned, the low order 3 bits are used "for special things"::
bit<0> decides if SP is saved on the stack (or not 99%)
bit<1> decides if FP is saved and updated (or restored)
bit<2> decides if a return is performed (used when SW walks a stack
back when doing try-throw-catch stuff.)

I use the HoB of register index to signal select stack pointer.

> Q+ also has a PUSHA / POPA instructions to push or pop all the
> registers, meant for interrupt handlers. PUSH and POP instructions by
> themselves can push or pop up to five registers.

By the time control arrives at interrupt dispatched, the old registers
have been saved and the registers of the ISR have been loaded; so have
ASID and ROOT,..... Thus an ISR can keep pointers in its register file
to quicken access when invoked.

BGB

unread,
Jan 26, 2024, 11:15:06 PMJan 26
to
Admittedly, it can make sense for an ISA intended for higher-end
hardware, but not necessarily something intended to aim for similar
hardware costs to something like an in-order RISC-V core.

In my case, the core seems to be within a similar LUT cost range to some
of the RISC-V soft-cores. Generally smaller than some of the superscalar
cores, but bigger than a lot of the in-order scalar cores.


Looks like, if one wants to optimize for ASIC though (vs FPGA), it makes
sense to minimize the use of SRAM.

So, say:
Multiple copies of the regfile (like RISC-V does) is still not ideal;
Might also make sense to try to optimize things for smaller caches,
possibly with more expensive logic (so, say, small set-associative
caches rather than bigger direct-mapped caches).


Seems though like a ringbus might still be a cheapish though, since it's
storage is mostly in the flip-flops used to implement the ring itself,
rather than needing SRAM FIFOs like in some other bus designs. I would
suspect that something like AXI or Wishbone would likely involve a
number of internal FIFO buffers.

Also, unlike my original bus design, it is not dead slow...


Looking at it, it seems "Wishbone Classic" is functionally similar, but
has different signaling. Whereas other versions of Wishbone would likely
need FIFOs to hold requests to be pushed around the bus.

Though, for whatever reason, they were going with 32 or 64-bit
transfers, whereas my bus was designed around sending data in 128-bit
chunks. Granted, potentially, passing 128-bits would cost more than
64-bits. However, I would expect the logic costs to deal with 64-bit
transfers might be higher than for 128-bit transfers (say, since now the
L1 caches would need to deal with multi-part transfers for each L1 cache
line; and the L2 would need to deal with its cache lines being accessed
in terms of a larger number of comparable smaller pieces).

You also wouldn't want 64-bit cache lines as then the tagging will cost
more than the payload data (whereas 128 and 256 bit have a better ratio
of tagging vs payload).


I would guess though, possibly for an ASIC, using 32B or 64B cache lines
might be preferable, as here a smaller amount of the total SRAM is spent
on tagging bits, and in relation the logic would be cheaper relative to
the cost of the SRAM.


...

Robert Finch

unread,
Jan 27, 2024, 1:47:34 AMJan 27
to
Once there is micro-code or a state machine to handle an instruction
with multiple micro-ops, it is not that costly to add other operations.
The Q+ micro-code cost something like < 1k LUTs. Many early micro's use
micro-code.
Q+ uses a 128-bit system bus the bus tag is not the same tag as used for
the cache. Q+ burst loads the cache with 4 128-bit accesses for 512 bits
and the 64B cache line is tagged with a single tag. The instruction /
data cache controller takes care of adjusting the bus size between the
cache and system.

I think I suggested this before, and the idea got shot down, but I
cannot find the post. It is mystery operations where the opcode comes
from a register value. I was thinking of adding an instruction modifier
to do this. The instruction modifier would supply the opcode bits for
the next instruction from a register value. This would only be applied
to specific classes of instructions. In particular register-register
operate instructions. Many of the register-register functions are not
decoded until execute time. The function code is simply copied to the
execution unit. It does not have to run through the decode and rename
stage. I think this field could easily come from a register. Seems like
it would be easy to update the opcode while the instruction is sitting
in the reorder buffer.

MitchAlsup1

unread,
Jan 27, 2024, 12:30:46 PMJan 27
to
Robert Finch wrote:

> On 2024-01-26 11:10 p.m., BGB wrote:
>> On 1/26/2024 10:58 AM, Robert Finch wrote:
>>><snip>
>>
>> Admittedly, it can make sense for an ISA intended for higher-end
>> hardware, but not necessarily something intended to aim for similar
>> hardware costs to something like an in-order RISC-V core.

> Once there is micro-code or a state machine to handle an instruction
> with multiple micro-ops, it is not that costly to add other operations.
> The Q+ micro-code cost something like < 1k LUTs. Many early micro's use
> micro-code.

The FMAC unit has a sequencer that performs FDIV, SQRT, and transcendental
polynomials. The memory unit has a sequencer to perform LDM, STM, MM, and
ENTER and EXIT.

>> <snip>
>>
> Q+ uses a 128-bit system bus the bus tag is not the same tag as used for
> the cache. Q+ burst loads the cache with 4 128-bit accesses for 512 bits
> and the 64B cache line is tagged with a single tag. The instruction /
> data cache controller takes care of adjusting the bus size between the
> cache and system.

A four (4) Beat burst is de rigueur for FPGA implementations.

> I think I suggested this before, and the idea got shot down, but I
> cannot find the post. It is mystery operations where the opcode comes
> from a register value. I was thinking of adding an instruction modifier
> to do this. The instruction modifier would supply the opcode bits for
> the next instruction from a register value. This would only be applied
> to specific classes of instructions. In particular register-register
> operate instructions. Many of the register-register functions are not
> decoded until execute time. The function code is simply copied to the
> execution unit. It does not have to run through the decode and rename
> stage. I think this field could easily come from a register. Seems like
> it would be easy to update the opcode while the instruction is sitting
> in the reorder buffer.

Classic 360 EXECUTE instruction ??
Basically, it sounds dangerous. {Side channels in plenty}

BGB

unread,
Jan 27, 2024, 3:46:29 PMJan 27
to
On 1/27/2024 11:25 AM, MitchAlsup1 wrote:
> Robert Finch wrote:
>
>> On 2024-01-26 11:10 p.m., BGB wrote:
>>> On 1/26/2024 10:58 AM, Robert Finch wrote:
>>>> <snip>
>>>
>>> Admittedly, it can make sense for an ISA intended for higher-end
>>> hardware, but not necessarily something intended to aim for similar
>>> hardware costs to something like an in-order RISC-V core.
>
>> Once there is micro-code or a state machine to handle an instruction
>> with multiple micro-ops, it is not that costly to add other
>> operations. The Q+ micro-code cost something like < 1k LUTs. Many
>> early micro's use micro-code.
>
> The FMAC unit has a sequencer that performs FDIV, SQRT, and transcendental
> polynomials. The memory unit has a sequencer to perform LDM, STM, MM, and
> ENTER and EXIT.
>

I had a mechanism that basically plugged the outputs of the MUL and ADD
units together in a certain way to perform FDIV and FSQRT via running in
a feedback loop (and would wait a certain number of clock-cycles for the
result to converge). Not particularly fast though, and the results were
debatable...

For FDIV, was faster and more accurate to route it through the Shift-Add
unit.


But, yeah, no microcode or sequencers or similar thus far in my case.
All instructions have needed to map directly to some behavior in the EX
stage.

The closest thing is a mechanism within the main FPU to support SIMD
operations, which is basically logic to MUX the inputs and outputs to
the FPU based on the current clock-cycle (then, say, rather than
stalling for 6 cycles for a scalar operation, one stalls for 10 cycles
for a SIMD operation). In this case, if the faster SIMD unit exists, the
SIMD instructions are mapped to that unit instead (which does 4 Binary16
or Binary32 ops in parallel).


And, anything that can't be done directly in the EX stages, hasn't been
done at all (or, if it depends on ).


>>> <snip>
>>>
>> Q+ uses a 128-bit system bus the bus tag is not the same tag as used
>> for the cache. Q+ burst loads the cache with 4 128-bit accesses for
>> 512 bits and the 64B cache line is tagged with a single tag. The
>> instruction / data cache controller takes care of adjusting the bus
>> size between the cache and system.
>
> A four (4) Beat burst is de rigueur for FPGA implementations.
>

In my case, it is 1 beat per L1 line (128-bits), effectively sending the
whole line at once.

If I were to do 32B cache lines, this would require two. Would also
complicate the logic in the L1 cache.


For an ASIC, it would likely be preferable to use 32B or 64B lines,
since logic is comparably cheaper, and it might be harder to justify
half the SRAM use in this case mostly going just to the tag bits.


Could in theory use 1 row of 32B lines rather than two rows of 16B
lines, but there would be a problem here in terms of memory ports
(BRAM|SRAM with 3 ports, 2R1W, vs 1RW or 1R1W, isn't a thing IIRC).

Generally need a way to do two accesses in parallel to be able to
support unaligned memory access (would only need 1 row, and a single
access, if the cache only supported aligned access).



I guess, possible could be to consider something like Wishbone B3 or B4
as a possible option. While it would likely involve FIFO's, it could
have lower latency than is possible with my ringbus design.

And, in this case, performance is more being limited by latency than by
bus capacity.


Or, an intermediate option could be to keep the existing bus signaling,
but merely replace much of the "hot path" parts of the ring with a sort
of "crossbar". This option wouldn't necessarily need any FIFOs to be
added. Though, if any of the endpoints become congested, it could
potentially deadlock the bus (say, for example, the L1 D$ gets
backlogged with requests and the L2 cache with responses, but because
there are no free spots in either micro-ring, then forward progress is
not possible).

Comparably, the existing strategy of reducing ring latency via
special-case paths is moderately effective (IOW: overall topology is
still a ring, but with the equivalent of on and off ramps for messages
to take different/shorter paths to their intended destination).


>> I think I suggested this before, and the idea got shot down, but I
>> cannot find the post. It is mystery operations where the opcode comes
>> from a register value. I was thinking of adding an instruction
>> modifier to do this. The instruction modifier would supply the opcode
>> bits for the next instruction from a register value. This would only
>> be applied to specific classes of instructions. In particular
>> register-register operate instructions. Many of the register-register
>> functions are not decoded until execute time. The function code is
>> simply copied to the execution unit. It does not have to run through
>> the decode and rename stage. I think this field could easily come from
>> a register. Seems like it would be easy to update the opcode while the
>> instruction is sitting in the reorder buffer.
>
> Classic 360 EXECUTE instruction ??
> Basically, it sounds dangerous. {Side channels in plenty}

Yeah.

Better to keep side-channels to a minimum.


In my case, only certain registers could have side channels:
DLR/R0, DHR/R1, and SP/R15;
Various CR's (LR, SPC, SSP, etc).

Though, many of these have ended up being read-only side-channels (the
usage of side channels to update registers has mostly been eliminated,
in favor of using normal register updates whenever possible).


The SP side-channel was mostly a consequence of:
Early on, my ISA had PUSH/POP which operated via a side-channel (Long
since eliminated);
Previously, the interrupt mechanism worked by swapping the values of SP
and SSP, rather than the current mechanism of swapping them in the decoder.

Note that the decoder also renumbers the registers in RV64 Mode.


All the normal GPRs are entirely inaccessible via side-channels.


In my current design, all register ports are resolved in the ID2 / RF stage.

Originally, predication was handled in EX1, but has been effectively
partly relocated to ID2 as well (with updates to SR.T being handled via
interlock stalls, if the following instruction depends on SR.T).


This did have the consequence of effectively also increasing CMPxx to
2-cycles, but did improve FPGA timing (though, this leaves the combined
compare-with-zero-and-branch as often preferable). Luckily, extending
these ops from 8s to 11s (or 13s in XG2 mode) did make them more useful
(the 13s case can branch +/- 8K).

Though, the split compare and branch cases still do have the advantage
of being able to reach a further distance (1MB in Baseline, 8MB in XG2).

Technically, there are still the two-register compare-and-branch ops,
but these are not enabled by default in any profile and still limited to
Disp8s. Main reason they are around is mostly because RISC-V mode needs
this feature to be enabled.

...

0 new messages