Misc (Maybe controversial): Scaled Index Load/Store?

Brendan Bohannon

unread,

Apr 11, 2022, 6:21:28 PM4/11/22

to RISC-V ISA Dev

So, best I can tell there doesn't seem to be any other proposals around for encodings for scaled-index load/store encodings, so I went and came up with something as a possibility.

https://pastebin.com/2jgBhXG5

These would be located within a seemingly unused corner of the 'A' extension's encoding space.

The instructions would allow for Load/Store using two registers to calculate the address (with an index that is shifted left by 0..3 bits).

For context, I have a core that (mostly) runs mu own ISA, but can also run RV64 as a sub-mode (currently RV64IM, plan is currently for RV64IMC_ZfinxZdinx).

The native ISA has this addressing mode, and it occurs frequently enough in practice that it makes sense to have it available in RV mode as well. Though, would prefer the encodings to not to stomp on other extensions which may or may not exist in this space.

Currently, support for 'A', 'F', or 'D' is unlikely. 'A' doesn't really map over at all, F/D don't really match up with my FPU design (Zdinx is at least a bit closer).

For 'M' some of the operations will be "absurdly slow" (eg: 67 cycle DIV/REM), but it should technically work (previously, in the native ISA was doing 64-bit multiply and divide and similar in software, via internal runtime calls). However, MULW will be a 3 cycle operation.

The native ISA is a 3-wide VLIW, but it may be possible eventually to support handle RV as an in-order superscalar (at the moment it will only run RV instructions one-at-a-time). Had considered a scheme to allow load-time conversion into a VLIW form (in software, via bit-twiddling the instructions), but this mechanism would be incompatible with the 'C' extension (so is low priority).

Probably enough for now...

Philipp Tomsich

unread,

Apr 11, 2022, 6:40:02 PM4/11/22

to Brendan Bohannon, RISC-V ISA Dev

Brendan,

On Tue, 12 Apr 2022 at 00:21, Brendan Bohannon <cr8...@gmail.com> wrote:

So, best I can tell there doesn't seem to be any other proposals around for encodings for scaled-index load/store encodings, so I went and came up with something as a possibility.

https://pastebin.com/2jgBhXG5

These would be located within a seemingly unused corner of the 'A' extension's encoding space.

The instructions would allow for Load/Store using two registers to calculate the address (with an index that is shifted left by 0..3 bits).

Zba already provides a full set of shift-by-{0,1,2,3}-and-add instructions that can be used for the address-calculation: add(.uw)?, sh[123]add(.uw)?.

These also cover the use-cases for unsigned 32bit indices on RV64 (i.e., ".uw" variants that zero-extend the index before shifting).

Note that these Zba-instruction can be combined with an existing load/store (and could be fused into a single op using instruction fusion) for the same effect as your proposal, without taking up additional codepoints from our opcode space.

For context, I have a core that (mostly) runs mu own ISA, but can also run RV64 as a sub-mode (currently RV64IM, plan is currently for RV64IMC_ZfinxZdinx).

The native ISA has this addressing mode, and it occurs frequently enough in practice that it makes sense to have it available in RV mode as well. Though, would prefer the encodings to not to stomp on other extensions which may or may not exist in this space.

As indicated above, instruction-fusion for Zba instructions with load/store instructions will give you the desired result w/o consuming additional instruction space.

--Philipp.

Currently, support for 'A', 'F', or 'D' is unlikely. 'A' doesn't really map over at all, F/D don't really match up with my FPU design (Zdinx is at least a bit closer).

For 'M' some of the operations will be "absurdly slow" (eg: 67 cycle DIV/REM), but it should technically work (previously, in the native ISA was doing 64-bit multiply and divide and similar in software, via internal runtime calls). However, MULW will be a 3 cycle operation.

The native ISA is a 3-wide VLIW, but it may be possible eventually to support handle RV as an in-order superscalar (at the moment it will only run RV instructions one-at-a-time). Had considered a scheme to allow load-time conversion into a VLIW form (in software, via bit-twiddling the instructions), but this mechanism would be incompatible with the 'C' extension (so is low priority).

Probably enough for now...

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/1561dca7-d5dc-4025-b12d-33c642bd96cen%40groups.riscv.org.

Jim Wilson

unread,

Apr 11, 2022, 8:09:04 PM4/11/22

to Brendan Bohannon, RISC-V ISA Dev

On Mon, Apr 11, 2022 at 3:21 PM Brendan Bohannon <cr8...@gmail.com> wrote:

So, best I can tell there doesn't seem to be any other proposals around for encodings for scaled-index load/store encodings, so I went and came up with something as a possibility.

Alibaba/T-Head cores have this, and it is already present in parts that use their cores, like the Allwinner D1. For instance, they have a lrb instruction that does

rd = sign_extend(mem[(rs1+rs2<<imm2)+7: (rs1+rs2<<imm2)])

The 7: looks a little confusing to me, but I think it means all bits from 0 to 7, i.e. 8 bits. This instruction is encoded in the custom-0 space as it is an Alibaba extension.

It is hard to get good docs for Alibaba cores outside China though, and even harder if you can't read Mandarin, or aren't handy with translation tools.

These would be located within a seemingly unused corner of the 'A' extension's encoding space.

I'd prefer that we add the missing char and short atomic support in the A extension encoding space. The absence of these instructions is a major hassle for the GCC port.

Jim

Brendan Bohannon

unread,

Apr 11, 2022, 8:42:20 PM4/11/22

to RISC-V ISA Dev, Philipp Tomsich, RISC-V ISA Dev, Brendan Bohannon

On Monday, April 11, 2022 at 5:40:02 PM UTC-5 Philipp Tomsich wrote:

Brendan,

On Tue, 12 Apr 2022 at 00:21, Brendan Bohannon <cr8...@gmail.com> wrote:
So, best I can tell there doesn't seem to be any other proposals around for encodings for scaled-index load/store encodings, so I went and came up with something as a possibility.

https://pastebin.com/2jgBhXG5

These would be located within a seemingly unused corner of the 'A' extension's encoding space.

The instructions would allow for Load/Store using two registers to calculate the address (with an index that is shifted left by 0..3 bits).

Zba already provides a full set of shift-by-{0,1,2,3}-and-add instructions that can be used for the address-calculation: add(.uw)?, sh[123]add(.uw)?.
These also cover the use-cases for unsigned 32bit indices on RV64 (i.e., ".uw" variants that zero-extend the index before shifting).

Note that these Zba-instruction can be combined with an existing load/store (and could be fused into a single op using instruction fusion) for the same effect as your proposal, without taking up additional codepoints from our opcode space.

The CPU core I have does not do instruction fusion, nor is it likely to do so in the immediate future. The mechanisms for instruction fusion have some similar costs to those needed to support superscalar, namely pattern matching and ability to advance the program counter by an increment which can't be easily determined during instruction fetch, which are preferably avoided.

Granted, admittedly, this is part of why my ISA originally went for VLIW rather than superscalar (eg: leaving a lot of this up to the compiler rather than having the CPU try to figure it out). At present, how far to advance PC for a bundle can be determined from a relatively small number of instruction bits.

Though, granted, if the encodings or similar exist, something like:

SHL3ADD x6, x10, x11

LD x12, x6, 0

Could be transformed into:

NOP || LD x12, (x10, x11*8)

During program load-time (the NOP is needed to absorb the space, would be executed in parallel with the load).

Though with the big drawback that this conversion mechanism is incompatible with the 'C' extension; would require compiling the binaries as RV64IM or similar, mostly as it had been implemented via bit-twiddling the low-order bits and setting the CPU into a special sub-mode when running code which has been modified in this way. The program loader would need to detect the presence of compressed instructions and disable this conversion if any are detected.

Note that this mode-switching can be made per-function by twiddling function-pointer bits in the GOT or similar (a similar mechanism is used to encode inter-ISA jumps in this case).

Translating from one ISA to another during load isn't really likely to be a workable option in these cases.

For context, I have a core that (mostly) runs mu own ISA, but can also run RV64 as a sub-mode (currently RV64IM, plan is currently for RV64IMC_ZfinxZdinx).

The native ISA has this addressing mode, and it occurs frequently enough in practice that it makes sense to have it available in RV mode as well. Though, would prefer the encodings to not to stomp on other extensions which may or may not exist in this space.

As indicated above, instruction-fusion for Zba instructions with load/store instructions will give you the desired result w/o consuming additional instruction space.

Granted, it is maybe a controversial idea, but ideally would be able to hopefully provide for reasonable performance in this area without needing to jump on with a fairly expensive mechanism, while being able to leave most of the code in RISC-V mode. At present, if RISC-V code being limited to running one instruction per cycle, and often needing 2x or 3x as many instructions in cases like this, this is not ideal on the performance front (even with GCC being arguably a lot more clever than my own C compiler).

But, as noted, RISC-V is a much more popular ISA, so there is some potential utility in being able to run code compiled for it (even if not in the 'G' profile).

My existing core is mostly running on Spartan-7 and Artix-7 FPGAs, typically running at a clock speed of 50MHz (getting up to 75MHz requires using small caches, which adversely effects performance).

Smallest FPGA I was able to fit on was an XC7S25, although this was 1-wide and without an FPU or MMU (I am using a software-managed TLB).

Most of my testing has been on an XC7A100T though.

As probably most hobby ISA projects, mostly running Doom, Quake, etc.

I also have a port of GLQuake (software rendered OpenGL), and a partial Quake 3 Arena port, though Quake 3 Arena uses considerably more RAM and is basically unusable on a 50 MHz CPU core. Performance from most Quake family games is still a bit lacking in general though (sadly, not really all that playable).

It can also be used for video playback, though limited to relatively low complexity video codecs (still mostly excludes MPEG family codecs).

Brendan Bohannon

unread,

Apr 11, 2022, 9:19:18 PM4/11/22

to RISC-V ISA Dev, jim.wil...@gmail.com, RISC-V ISA Dev, Brendan Bohannon

On Monday, April 11, 2022 at 7:09:04 PM UTC-5 jim.wil...@gmail.com wrote:

On Mon, Apr 11, 2022 at 3:21 PM Brendan Bohannon <cr8...@gmail.com> wrote:
So, best I can tell there doesn't seem to be any other proposals around for encodings for scaled-index load/store encodings, so I went and came up with something as a possibility.

Alibaba/T-Head cores have this, and it is already present in parts that use their cores, like the Allwinner D1. For instance, they have a lrb instruction that does
rd = sign_extend(mem[(rs1+rs2<<imm2)+7: (rs1+rs2<<imm2)])
The 7: looks a little confusing to me, but I think it means all bits from 0 to 7, i.e. 8 bits. This instruction is encoded in the custom-0 space as it is an Alibaba extension.

It is hard to get good docs for Alibaba cores outside China though, and even harder if you can't read Mandarin, or aren't handy with translation tools.

I am mostly dealing with this in the context of my own core on an FPGA.

Mostly just sort of need to stake out a claim on the relevant encoding space or similar (such as "if it comes up later", ideally want to avoid a conflict).

This extension isn't a big enough extension to justify using the custom spaces.

In my case (from my own builds), scaled-index load/store seems to be around 15-20% of the total for loads/stores (with around 80-85% being constant-displacement forms), this seems like potentially enough for it to matter if one needs to encode each as 2 or 3 instructions (in a performance sensitive use case).

I had previously investigated also (Rb+Ri*Sc+Disp), but even with compiler support, these would come up rarely enough to not make much difference, so can probably ignore this case. Similarly, can also skip things like auto-increment.

These would be located within a seemingly unused corner of the 'A' extension's encoding space.

I'd prefer that we add the missing char and short atomic support in the A extension encoding space. The absence of these instructions is a major hassle for the GCC port.

Probably also true.

My design should be orthogonal to this, since adding more types would mostly depend on using funct3 as other types within the same set of operations (this part of the 'atomic' encoding seems to follow the same pattern as other Load/Store operations; just with only W and D cases having been defined).

My design wouldn't interfere here, as it was built using a seemingly unused pattern in the funct7 field.

There are still a few more bit patterns left for more possible atomic operations or similar.

Robert Finch

unread,

Apr 11, 2022, 11:04:06 PM4/11/22

to RISC-V ISA Dev, cr8...@gmail.com, jim.wil...@gmail.com, RISC-V ISA Dev

On the topic of fusion:

Could there be a generic NOP instruction that provides two additional register read ports and one additional write port for fusing with a following instruction? I have added a custom instruction called ‘REG’ to my design for this purpose.

I also use fused constant extensions in my RISCV design.

I added scaled indexed addressing as custom instructions under the opcode 13 space to my own RISCV core. I have since decided to drop the scaling because my low-cost compiler could not use it a significant portion of the time. It ended up using shift and add instructions a lot of the time to scale the index. Given the scaling was used only a fraction of a percentage of the time, I figure it is not worth providing directly. To keep the limit of two register reads in an instruction, I decided to use a prefix instruction to provide extra registers for indexed store operations among other things. This amounts to a form of instruction fusing. I still think it would be a good idea to have indexed addressing represented as such in assembly language. Indexed load operations make sense to me, but stores require too many register reads to implement directly.

If you are going to have indexed addressing in the design maybe we could keep the opcodes consistent?

I am using: |Rs3|Sc|~~|op3|Rs1|6|Rd|13| for indexed loads

and |Rs3|Sc|Rs2|Rs1|7|~~|op3|13| for indexed stores.

Sc the scale is not used. And stores will be using fused instructions.

BGB

unread,

Apr 12, 2022, 2:47:13 AM4/12/22

to Robert Finch, RISC-V ISA Dev, jim.wil...@gmail.com

On 4/11/2022 10:04 PM, Robert Finch wrote:
>
> On the topic of fusion:
>
> Could there be a generic NOP instruction that provides two additional
> register read ports and one additional write port for fusing with a
> following instruction? I have added a custom instruction called ‘REG’
> to my design for this purpose.
>
> I also use fused constant extensions in my RISCV design.
>
>

I guess that is an interesting approach.

In my core, as can be noted, it is using a 6R/3W register file,
supporting up to 3 instructions per clock cycle.
However, Lane 3 is infrequently used in VLIW bundles, and this is where
the extra register ports come from (my ISA disallowing Store or similar
within a 3-wide bundle).

        ALU | ST // OK
ALU | ALU | ST // Invalid
ALU | ALU | ALU // OK

A few other instructions, such as DMAC (Rd=Rd+Rs*Rt), have a similar
restriction to Store in this regard.
Things like Load/Store operations< Branches, instructions like DMAC or
FMAC, ..., are only allowed in Lane 1.

Where the bundle packing is, effectively:
            Lane1
          Lane2 | Lane1
Lane3 | Lane2 | Lane1

Natively, each instruction lane has 2 read ports and 1 write port, with
Lanes 1 and 2 having access to a 3rd read port which is aliased to the
Lane3 read ports.

How the ports map depends primarily on the bundle length and similar
(with some special cases).

Some operations in my ISA, like Load/Store Pair, and 128-bit SIMD
operations, will effectively eat all 3 lanes at the same time, so are
not allowed to be used in bundles at all.

Effectively, 128-bit SIMD instructions treat the register file as 3R+1W
with logical 128-bit registers (32x 128b, rather than 64x 64-bit), while
typically combining the functional units for the first two lanes in
parallel (likewise for 128-bit operations, only even-numbered registers
are allowed for the 128-bit registers) for a single logical operation.

Read ports are named Rs,Rt,Ru,Rv,Rx,Ry

So, Bundles:
OP1 Rs, Rt, Rx -> Rn1
OP2 Ru, Rv, Ry -> Rn2 | OP1 Rs, Rt, Rx -> Rn1
OP3 Rx, Ry, Z -> Rn3 | OP2 Ru, Rv, Ry -> Rn2 | OP1 Rs, Rt, Rx -> Rn1

SIMD (128-bit):
OP1 Ru:Rs, Rv:Rt, Ry:Rx -> Rn2:Rn1

Base encoding: 16 or 32 bit
Bundles: Between 1x and 3x 32-bit instruction words;
Only 32-bit encodings may be bundled.
The 16-bit encodings are mostly for size-optimized code.
    Say, one can use 16/32 for "cold path", and VLIW for "hot path".

My CPU core is strictly in-order, and all instructions within a bundle
execute in lock-step with each other.

The compiler is responsible for things like bundles being well formed:
Correct dispatch width for the machine (this is explicit in the
architecture);
    However, it is basically frozen at 3 for the mainline profile.
Doesn't try to execute a combination of instructions which is not
allowed;
    Trying to run an instruction in the incorrect lane is is
non-recoverable;
    Invalid combinations may lead to undefined results or a fault.
...

The 3-wide case also was best on the cost/benefit tradeoff, wasn't much
more expensive than 2-wide, but can do more work per clock cycle (more
in terms in increasing the number of valid 2-wide configurations, than
in terms of Lane 3 being used all that often in its own right).

PC advance is determined during Instruction Fetch.
No pattern matching is used (beyond basic instruction and bundle
length determination);
There is no sort of "error recovery" for getting this wrong.

There is also no traditional instruction fusion, nor splitting, nor
microcode. Most one can to is generate a fault, but interrupt handling
is particularly slow.

Pattern matching instructions during IF (Fetch) would likely be not
ideal for timing, and during ID (Decode) it is too late to act on it (by
ID "this train is already in motion", and IF is now dealing with the
next bundle).

Theoretically, one can modulo schedule loops, but my compiler isn't
really smart enough to pull this off effectively so typically the
programmer needs to do this part manually (works a whole lot better in ASM).

As can be noted, I was able to map a RV64I decoder over to this pipeline
without too much issue (mostly effects IF and ID, stages beyond this
don't really care).

As-is, it just sort of executes like a naive 1-wide machine, or IOW,
"not particularly fast".

> I added scaled indexed addressing as custom instructions under the
> opcode 13 space to my own RISCV core. I have since decided to drop the
> scaling because my low-cost compiler could not use it a significant
> portion of the time. It ended up using shift and add instructions a
> lot of the time to scale the index. Given the scaling was used only a
> fraction of a percentage of the time, I figure it is not worth
> providing directly. To keep the limit of two register reads in an
> instruction, I decided to use a prefix instruction to provide extra
> registers for indexed store operations among other things. This
> amounts to a form of instruction fusing. I still think it would be a
> good idea to have indexed addressing represented as such in assembly
> language. Indexed load operations make sense to me, but stores require
> too many register reads to implement directly.
>

Curious. As noted, I was typically seeing scaled-index operations closer
to around 15-20%.

Though, in my ISA, the scale is typically hard-wired to either 1 or the
element size (depends base register, PC/GBR/TBR use 1, everything else
typically uses the element size). Could save 2 bits by leaving off the
scale, and probably hard-wiring it to the element size.

A lot of the code I am working with uses arrays in loops, and arrays
generate a lot of these.

As for implementation cost:
I would probably assume not having indexed load/store on a small
1-wide machine.

So, it would make more sense for "fast but simple" cores, as opposed to
"minimalist cores".

By the time we are up to bigger superscalar style cores, it may or may
not matter, but probably doesn't hurt too much either (a superscalar
machine can probably scavenge up a spare read port from somewhere).

A prefix could work, I guess the main thing is being able to recognize
it while checking a fairly small number of bits.

> If you are going to have indexed addressing in the design maybe we
> could keep the opcodes consistent?
>
> I am using: |Rs3|Sc|~~|op3|Rs1|6|Rd|13| for indexed loads
>
> and |Rs3|Sc|Rs2|Rs1|7|~~|op3|13| for indexed stores.
>
> Sc the scale is not used. And stores will be using fused instructions.
>

I am not entirely sure I understand your notation...

I am guessing you put it off with the ALU ops or similar?...

Philipp Tomsich

unread,

Apr 12, 2022, 4:44:53 AM4/12/22

to Jim Wilson, Brendan Bohannon, RISC-V ISA Dev

Jim,

On Tue, 12 Apr 2022 at 02:09, Jim Wilson <jim.wil...@gmail.com> wrote:

On Mon, Apr 11, 2022 at 3:21 PM Brendan Bohannon <cr8...@gmail.com> wrote:
So, best I can tell there doesn't seem to be any other proposals around for encodings for scaled-index load/store encodings, so I went and came up with something as a possibility.

Alibaba/T-Head cores have this, and it is already present in parts that use their cores, like the Allwinner D1. For instance, they have a lrb instruction that does
rd = sign_extend(mem[(rs1+rs2<<imm2)+7: (rs1+rs2<<imm2)])
The 7: looks a little confusing to me, but I think it means all bits from 0 to 7, i.e. 8 bits. This instruction is encoded in the custom-0 space as it is an Alibaba extension.

The version of the English-language documentation I have says

rd <- sign_extend(mem[(rs1 + rs2 << imm2)])

although the fact that the sign-extension happens on the byte-granule is only visible from the instruction's description as "a load byte instruction [...] and extends signed bits".

I'll have another chance to bring up documentation quality in a call with the T-Head folks later this week…

Hope this clarifies the situation,

Philipp.

Robert Finch

unread,

Apr 12, 2022, 7:41:28 AM4/12/22

to RISC-V ISA Dev, Philipp Tomsich, cr8...@gmail.com, RISC-V ISA Dev, jim.wil...@gmail.com

>Curious. As noted, I was typically seeing scaled-index operations closer
>to around 15-20%.

Okay, I was seeing about 1 to 3% of total instructions using indexed addressing (or about 4 to 12% of memory ops). Of course, my dataset was limited to boot functions and a couple of small demos. Not really enough statistically to be very valid, but enough to indicate that the crowd numbers IIRC (numbers read from architecture texts) were probably correct. This was more than the 1% sort of minimum I had for an instruction feature to qualify as a must include. Of the 1% only about ¼ effectively used index scaling. The issue is that the compiler would not evaluate the index expression in an order suitable for the optimizer to be able to use it. I could maybe improve the compiler in this regard, but it is not a simple task. It would have to evaluate expressions multiple times in multiple orders then choose the best ordering.

I was looking at using generic fuse instructions (prefixes basically) because then the core does not need to pattern match. As soon as it sees a generic it knows to fuse it into the next instruction. I need this functionality because there are a couple of instructions that require four register read ports or two write ports, FDP – fused dot product for instance.

Allen Baum

unread,

Apr 12, 2022, 3:51:25 PM4/12/22

to Robert Finch, RISC-V ISA Dev, Philipp Tomsich, cr8...@gmail.com, jim.wil...@gmail.com

Aside from whether a [scaled] indexed load/store is a good idea, I am pleased and encouraged by the amount of experimentation I'm seeing here.

The usefulness of ops like these need good compiler support, and it's hard to see how useful these are (or not) without that.

A large problem is the asymmetry of laid vs. store: load has 2R+1W, but store has 3R+0W, and that 3rd read port makes it very expensive.

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/d0991e7b-d504-4c94-8d7a-ffb8362ceb36n%40groups.riscv.org.

Jeff Scott

unread,

Apr 12, 2022, 4:01:52 PM4/12/22

to Allen Baum, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, cr8...@gmail.com, jim.wil...@gmail.com

I don’t disagree Allen, but PowerPC and Arm both included Reg+Reg addressing. Add that to the wide variety of implementations in this thread, as well as RISCY’s, and I think there is justification for a standard extension which supports it.

Jeff

From: Allen Baum <allen...@esperantotech.com>
Sent: Tuesday, April 12, 2022 2:51 PM
To: Robert Finch <robf...@gmail.com>
Cc: RISC-V ISA Dev <isa...@groups.riscv.org>; Philipp Tomsich <philipp...@vrull.eu>; cr8...@gmail.com <cr8...@gmail.com>; jim.wil...@gmail.com <jim.wil...@gmail.com>
Subject: [EXT] Re: [isa-dev] Misc (Maybe controversial): Scaled Index Load/Store?

Caution: EXT Email

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAF4tt%3DAfQXSWu6jNyGruo26ODVWbikb%2BuDPp6C9jjkKdeAQyRQ%40mail.gmail.com.

Philipp Tomsich

unread,

Apr 12, 2022, 4:05:20 PM4/12/22

to Allen Baum, Robert Finch, RISC-V ISA Dev, cr8...@gmail.com, jim.wil...@gmail.com

Note that opportunities for (scaled) index load will be seen in the hot blocks of the following SPEC 2017 benchmarks:

deepsjeng
xz (although some will be of the form "(base + index << shamt) + signed-immediate")
leela (most of these will have a signed-immediate attached)
mcf

We'd need to quantify in more detail to provide useful data to drive future decisions, but this is not on our radar at the moment.

In a first approximation, we'll probably look at easy fusion cases vs. non-fusible cases.

Philipp.

Philipp Tomsich

unread,

Apr 12, 2022, 4:10:28 PM4/12/22

to Jeff Scott, Allen Baum, Robert Finch, RISC-V ISA Dev, cr8...@gmail.com, jim.wil...@gmail.com

Jeff,

Do you have data on the expected benefit?

In our experience of analyzing generated code sequences, easy fusion cases (i.e. "add rd,rs1,rs2; l[bhwd] rd,0(rd)") can be used together with register-steering to allow for compressed instructions both for the add and load. This will (if the compressed forms can be used) give you a sequence that is equivalent to the reg+reg addressing and will fit into 32bits.

Cheers,

Philipp.

Jeff Scott

unread,

Apr 12, 2022, 4:32:45 PM4/12/22

to Philipp Tomsich, Allen Baum, Robert Finch, RISC-V ISA Dev, cr8...@gmail.com, jim.wil...@gmail.com

Hi Philipp,

No data to share unfortunately. I have never done fusion before, but the concept intrigues me. Any idea how many gates it takes to implement fusion like this compared to having a dedicated instruction? I assume the cost increases with issue width? Frequency impact?

Jeff

BGB

unread,

Apr 12, 2022, 4:46:50 PM4/12/22

to Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com

On 4/12/2022 6:41 AM, Robert Finch wrote:
>
> >Curious. As noted, I was typically seeing scaled-index operations closer
> >to around 15-20%.
>
> Okay, I was seeing about 1 to 3% of total instructions using indexed
> addressing (or about 4 to 12% of memory ops). Of course, my dataset
> was limited to boot functions and a couple of small demos. Not really
> enough statistically to be very valid, but enough to indicate that the
> crowd numbers IIRC (numbers read from architecture texts) were
> probably correct. This was more than the 1% sort of minimum I had for
> an instruction feature to qualify as a must include. Of the 1% only
> about ¼ effectively used index scaling. The issue is that the compiler
> would not evaluate the index expression in an order suitable for the
> optimizer to be able to use it. I could maybe improve the compiler in
> this regard, but it is not a simple task. It would have to evaluate
> expressions multiple times in multiple orders then choose the best
> ordering.
>

I meant 15-20% of memory instructions, not total instructions. If it is
a count of total instructions, it is a bit less (around 5% or so).

As per clock-cycle cost (emulator stats), in general, memory ops are
around 25% of the total (for base cycles), so Load/Store Index would be
around 3%-5% of total clock cycles.

Memory operations (in general) would be closer to around 60% or so if
one counts the penalties from L1 and L2 misses. But, this is not
dependent on address mode.

I will agree that flexible scale is rarely used (as can be noted, my ISA
was using hard-wired scale for the 32-bit encodings). In my case, one
needs to use a 64-bit instruction encoding to get a scale (that is not
hard-wired to the element size), but this form basically gives an
{Rb+Ri*Sc+Disp} mode, which is rarely used if enabled, so one could
probably leave it out.

> I was looking at using generic fuse instructions (prefixes basically)
> because then the core does not need to pattern match. As soon as it
> sees a generic it knows to fuse it into the next instruction. I need
> this functionality because there are a couple of instructions that
> require four register read ports or two write ports, FDP – fused dot
> product for instance.
>
>

I have something similar in my ISA, I called them "jumbo prefixes":
One serves the role of basically, "glue 24 bits onto the immediate of
the following instruction"
    So, Disp9u becomes Disp33s, Imm9u becomes Imm33s, ...
The other glues on more opcode bits, and typically either:
    A smaller immediate extension or a 4th register field (usable with
3R+1W instructions).
    Extends Imm9 fields to Imm17s.
    Modifies (Rb, Ri) to (Rb, Ri*Sc, Disp9u)

Sticking two of the former onto certain instructions is used to expand
the immediate to 64 bits.
However, this is now treated as a multi-lane operation by the CPU core.
Internally, each lane only has 33 bits for an immediate value, so
uses 2 lanes to deal with 64 bits.

Encoding for these was, basically, there were a few large constant load
instructions, which I made "Lane 1 only", and if one tries to encode
them in Lane 2 or 3, they are interpreted as a Jumbo prefix.

In my ISA notation, this would be for encoding things like:
MOV 0x123456789ABCDEF, R37
ADD 0x123456789ABCDEF, R28
( Which are roughly 96-bit encodings for 1-cycle operations )

An analog in RISC-V would be, say:
If the low 2 bits of the instruction encoded bundling (as in a
modified/bundled form);
AUIPC, if put into Lane 2 or 3, became such a prefix (able to glue 25
bits onto the following instruction).

Though, there would be no real way to use jumbo encodings without the
compiler being aware of them (and able to encode them), which would not
likely be the case in this situation (if we assume a compiler that has
no knowledge of the bundling process).

This would depend on another processor state bit (WXE) wrt the low 2
bits (when not 11):
WXE=0: Interpret them as Compressed instructions.
WXE=1: Interpret them as Bundled instructions.

Within my ISA, the WXE bit is mostly used to encode the interpretation
of bundled encodings:
WXE=0: We are running Scalar / 1-wide code (RISC style).
WXE=1: We are running VLIW / 3-wide code (also known as WEX in my ISA).

In my case, this doesn't really change the instruction format, but is
needed partly to allow for binary compatibility in other areas, given
dispatch-width is explicit in the architectural state.

With another bit (WX2) encoding an ISA selector:
WX2=0: Run in BJX2 mode (native ISA)
WX2=1: Run in RISC-V mode

In my ISA (and when running in RISC-V mode on my core) these state bits
are effectively hidden in things like function pointers and
link-register values. This is a potential point of concern with the
implementation, but best I can gather it shouldn't matter for normal
programs (which don't tend to twiddle with function-pointer bits or the
link register or similar).

Though, one could argue that a core built specifically for RISC-V could
likely be both cheaper (and maybe faster) than one built for an entirely
different ISA and then running RISC-V via a decoder hack.

On something like an XC7S25, something like RV32I probably makes more
sense, as one can implement it without basically eating pretty much the
entire LUT budget of the FPGA.

I had a smaller 32-bit ISA (basically a 1-wide 32-bit predecessor to my
BJX2 ISA design), which had a LUT cost of around 5 kLUT.

It was roughly comparable in resource cost to another modified ISA
design of mine, which was basically an SH2-like subset of the SH4 ISA
(SuperH), which generally had fixed-length 16-bit instructions.

Was generally hard to get a usable 32-bit CPU core in much less than
around 5 kLUT or similar in my experimenting.

My current core (with all the stuff I have going on / enabled), is
roughly an order of magnitude larger than this...

But, in this case, I was mostly trying to build something "kinda fast"
that still fits on an FPGA, rather than something that really makes
sense as a microcontroller.

Philipp Tomsich

unread,

Apr 12, 2022, 4:51:50 PM4/12/22

to Jeff Scott, Allen Baum, Robert Finch, RISC-V ISA Dev, cr8...@gmail.com, jim.wil...@gmail.com

I'll have to defer to others on the implementation aspects and area cost, as I always make sure to stay on the software side of things.

Best regards,
philipp.

Jeff Scott

unread,

Apr 12, 2022, 4:53:16 PM4/12/22

to Philipp Tomsich, Allen Baum, Robert Finch, RISC-V ISA Dev, cr8...@gmail.com, jim.wil...@gmail.com

So nice to not have to worry about area and frequency! 😊

Edwin Sutanto

unread,

Apr 12, 2022, 5:10:44 PM4/12/22

to Jeff Scott, Philipp Tomsich, Allen Baum, Robert Finch, RISC-V ISA Dev, cr8...@gmail.com, jim.wil...@gmail.com

paper and presentation regarding fusion if you are interested

https://reviews.llvm.org/D73643

I don't fwd the pdf directly since I don't know if that's allowed in this forum

thanks

Edwin

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/AM6PR04MB6053538BEA5C6F42D256BBA58DED9%40AM6PR04MB6053.eurprd04.prod.outlook.com.

Jeff Scott

unread,

Apr 12, 2022, 5:24:02 PM4/12/22

to Edwin Sutanto, Philipp Tomsich, Allen Baum, Robert Finch, RISC-V ISA Dev, cr8...@gmail.com, jim.wil...@gmail.com

Thanks Edwin!

Taking a look…

Jeff

BGB

unread,

Apr 12, 2022, 8:53:53 PM4/12/22

to Jeff Scott, Philipp Tomsich, Allen Baum, Robert Finch, RISC-V ISA Dev, jim.wil...@gmail.com

On 4/12/2022 3:53 PM, Jeff Scott wrote:
>
> So nice to not have to worry about area and frequency! 😊
>

Yeah, opcode fusion is one of those things that looks good in a
theoretical / on-paper sense, but less good to someone who frequently
battles with resource cost and timing constraints...

Things like "if(A==B)" and "C=A+B;" are not really your friends when it
comes to cost or timing (at least on an FPGA).

(Rb, Ri), (Rb, Ri*Sc): Commonly useful enough to be desirable.
   Main drawback is that store needs 3 read ports
   This is likely (mostly) an issue for 1-wide implementations, such
as microcontrollers.
   2 or 3 wide machines could reuse ports from another lane or similar.
   Can use the same adder for both index and displacement.

(Rb, Ri*Sc, Disp): Infrequently useful, less ideal for timing.
   Kinda need to get clever with carry propagation here to keep cost low.
   Multiple adders in a row is not ideal for timing.
   Sorta works OK if the displacement is known to be small.
      Say, if one special-cases the adder design for a 12-bit displacement.

Autoincement:
No, these have both implementation costs and don't actually save any
clock cycles.
An explicit increment can nearly always be "absorbed" by another
instruction or bundle.
(These, along with branch delay slots, are things I actively decided
against).

Encoding explicit sign/zero extension for index values is probably not
necessary, cheaper to assume that the index is already sign or zero
extended.

Not really many other modes that really seem to need mention.

I guess, Load/Store Pair can also be useful, but mostly only makes sense
if one assumes that the core is 2 or 3 wide internally.
In my case, these instructions exist on a 3-wide profile, but not on the
1-wide profile (code that is meant for a 1-wide core is not allowed to
use them).

I guess I can note that in my case, the index is also internally limited
to 33 bits (treated as sign extended), rather than the full 48 bits of
the address space. Partly this is for cost reasons, and also because it
is unlikely that one is going to need arrays or similar much bigger than
4 billion elements all that often.

This is N/A for things like "ptr1-ptr2" or "ptr+long", which need to
support the full address space.

The compiler can take a hint though from the type of index used:
int / unsigned int: Use the scaled-index forms.
long / unsigned long: Use ALU operations for address calculation.

...

> Jeff
>
> *From:* Philipp Tomsich <philipp...@vrull.eu>
> *Sent:* Tuesday, April 12, 2022 3:52 PM
> *To:* Jeff Scott <jeff....@nxp.com>
> *Cc:* Allen Baum <allen...@esperantotech.com>; Robert Finch

> <robf...@gmail.com>; RISC-V ISA Dev <isa...@groups.riscv.org>;
> cr8...@gmail.com <cr8...@gmail.com>; jim.wil...@gmail.com
> <jim.wil...@gmail.com>

> *Subject:* Re: [EXT] Re: [isa-dev] Misc (Maybe controversial): Scaled
> Index Load/Store?
>
> *Caution: *EXT Email

>
> I'll have to defer to others on the implementation aspects and area
> cost, as I always make sure to stay on the software side of things.
>
> Best regards,
> philipp.
>
> On Tue, 12 Apr 2022 at 22:32, Jeff Scott <jeff....@nxp.com> wrote:
>
> Hi Philipp,
>
> No data to share unfortunately. I have never done fusion before,
> but the concept intrigues me. Any idea how many gates it takes to
> implement fusion like this compared to having a dedicated
> instruction? I assume the cost increases with issue width?
> Frequency impact?
>
> Jeff
>

> *From:* Philipp Tomsich <philipp...@vrull.eu>
> *Sent:* Tuesday, April 12, 2022 3:10 PM
> *To:* Jeff Scott <jeff....@nxp.com>
> *Cc:* Allen Baum <allen...@esperantotech.com>; Robert Finch

> <robf...@gmail.com>; RISC-V ISA Dev <isa...@groups.riscv.org>;
> cr8...@gmail.com <cr8...@gmail.com>; jim.wil...@gmail.com
> <jim.wil...@gmail.com>

> *Subject:* Re: [EXT] Re: [isa-dev] Misc (Maybe controversial):
> Scaled Index Load/Store?
>
> *Caution: *EXT Email

>
> Jeff,
>
> Do you have data on the expected benefit?
>
> In our experience of analyzing generated code sequences, easy
> fusion cases (i.e. "add rd,rs1,rs2; l[bhwd] rd,0(rd)") can be used
> together with register-steering to allow for compressed
> instructions both for the add and load. This will (if
> the compressed forms can be used) give you a sequence that is
> equivalent to the reg+reg addressing and will fit into 32bits.
>
> Cheers,
>
> Philipp.
>
> On Tue, 12 Apr 2022 at 22:01, Jeff Scott <jeff....@nxp.com> wrote:
>
> I don’t disagree Allen, but PowerPC and Arm both included
> Reg+Reg addressing. Add that to the wide variety of
> implementations in this thread, as well as RISCY’s, and I
> think there is justification for a standard extension which
> supports it.
>
> Jeff
>

> *From:* Allen Baum <allen...@esperantotech.com>
> *Sent:* Tuesday, April 12, 2022 2:51 PM
> *To:* Robert Finch <robf...@gmail.com>
> *Cc:* RISC-V ISA Dev <isa...@groups.riscv.org>; Philipp

> Tomsich <philipp...@vrull.eu>; cr8...@gmail.com
> <cr8...@gmail.com>; jim.wil...@gmail.com
> <jim.wil...@gmail.com>

> *Subject:* [EXT] Re: [isa-dev] Misc (Maybe controversial):
> Scaled Index Load/Store?
>
> *Caution: *EXT Email

> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2Fd0991e7b-d504-4c94-8d7a-ffb8362ceb36n%2540groups.riscv.org%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjeff.scott%40nxp.com%7C1e84efc2fa974555c96b08da1cc642cf%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C637853935100722338%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AU6RJ3NFnh%2FjoAVs9YiYr%2F9xk2mZhFlv1ltDEOfZEuM%3D&reserved=0>.

>
> --
> You received this message because you are subscribed to the
> Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from
> it, send an email to isa-dev+u...@groups.riscv.org.
> To view this discussion on the web visit

> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAF4tt%3DAfQXSWu6jNyGruo26ODVWbikb%2BuDPp6C9jjkKdeAQyRQ%40mail.gmail.com
> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2FCAF4tt%253DAfQXSWu6jNyGruo26ODVWbikb%252BuDPp6C9jjkKdeAQyRQ%2540mail.gmail.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjeff.scott%40nxp.com%7C1e84efc2fa974555c96b08da1cc642cf%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C637853935100722338%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=FBquHPmo2dA7wMyK70RookGZjQuwsgPfE6vp3Bzv1YE%3D&reserved=0>.
>

Allen Baum

unread,

Apr 12, 2022, 10:09:51 PM4/12/22

to BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com

Prefix immediate ops aren't new (I know of 2 microprocessors that had that: Transputer, and GE RPM, 30 years ago)

History repeats. I actually have a (long expired) patent on that which used a CSR so you could macrofuse or interrupt in the middle and continue.

The Mill does a lot of immediate prefixing+ slot register borrowing in its architecture as well).

But, what Philip said, and I should have mentioned extensive benchmarking to see if it pays off.

Fusing 2 *compressed* ops is much more attractive, should be fairly cheap, and is a lot more flexible.

Chris Celio had numbers for that either in a publication or thesis (or both...)

https://riscv.org/wp-content/uploads/2016/07/Tue1130celio-fusion-finalV2.pdf

https://arxiv.org/abs/1607.02318

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/768ca2c6-1e67-902e-912f-a8597a48fd79%40gmail.com.

Allen Baum

unread,

Apr 12, 2022, 10:17:23 PM4/12/22

to BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com

And: macrofusion doesn't have to handle every possible pair of ops,

but just c.Add Rd, Idx, Base, c.Ldx Rd, 0(rd), (index edload) or various combinations that do autoincrement are easy low hanging fruit.

Allen Baum

unread,

Apr 13, 2022, 2:12:37 AM4/13/22

to Paul Campbell, Jeff Scott, Philipp Tomsich, RISC-V ISA Dev, Robert Finch, jim.wil...@gmail.com, BGB

I don't think someone understands how low cost fusion is supposed to work.

One constraint is that it is still a max of 2 source, 1 destination.

So indexed load can work: ADD dest<-base+index; Load dest<-0(dest) is an indexed load with only 2 sources, and a single destination.

Note that you can't do that with an indexed store, which really needs 3 sources and no destination.

You can't do an autoincrement load, but an autoincremented store is ADDI Base<-Base+Increment; STx Source->0(Base) or STx Source, 0(Base); Addi Base<-Base+Increment

(the latter is a bit funky because you need to bypass the adder to get the memory address, or have a separate adder for address arithmetic)

There are lots of other useful idioms (see Chris Celios paper) that can be performed with those constraints.

On Tue, Apr 12, 2022 at 9:36 PM Paul Campbell <pa...@taniwha.com> wrote:

On Wednesday, 13 April 2022 12:53:48 PM NZST BGB wrote:
> Yeah, opcode fusion is one of those things that looks good in a
> theoretical / on-paper sense, but less good to someone who frequently
> battles with resource cost and timing constraints...
>
> Things like "if(A==B)" and "C=A+B;" are not really your friends when it
> comes to cost or timing (at least on an FPGA).

or "now your 6 load/store units need an extra register read port each" ....

Paul

BGB

unread,

Apr 13, 2022, 3:52:18 AM4/13/22

to Paul Campbell, Jeff Scott, Philipp Tomsich, isa...@groups.riscv.org, Allen Baum, Robert Finch, jim.wil...@gmail.com

On 4/12/2022 11:36 PM, Paul Campbell wrote:
> On Wednesday, 13 April 2022 12:53:48 PM NZST BGB wrote:

>> Yeah, opcode fusion is one of those things that looks good in a
>> theoretical / on-paper sense, but less good to someone who frequently
>> battles with resource cost and timing constraints...
>>
>> Things like "if(A==B)" and "C=A+B;" are not really your friends when it
>> comes to cost or timing (at least on an FPGA).

> or "now your 6 load/store units need an extra register read port each" ....

I don't think that is the main issue.

It is also not the issue with opcode fusion, which is more about the
cost of pattern-matching various special-case combinations and then
fetching and decoding them as-if they were a single (larger)
instruction. Since this logic will be specific to specific instructions
in specific combinations, it is likely to be a more expensive approach
if compared with more traditional approaches, eg:
Adding a new opcode;
Or:
Adding a dedicated prefix which expands or modifies the instruction
it combines with.

As for scaled-index:
For wide-issue designs, it is possible to implement it in a way which
does not require increasing the total number of register ports, by
aliasing the ports between lanes (with certain lanes becoming unusable
if their ports are in-use by another lane).

This can allow having both 3 lanes on a 6R register file, as well as
also having instructions using 3 inputs.
Just you can't run a 3-input instruction while also running something
in Lane 3.
Basically, running a 3-input instruction in Lane 1 or 2 means Lane 3
is off-limits during that cycle.

In terms of the abstract model, each lane is natively 2R+1W, with 64-bit
data values, and a 33-bit immediate, a major+minor opcode, ...

Decided to leave out a more detailed description of my cores' functional
units and pipeline structure, though relevant to note here is that many
units are shared between lanes.

So, you can do:
ALU | ALU | ALU //OK
ALU | ALU | FPU //OK
ALU | ALU | MUL //OK
       ALU | FPU //OK
       ALU | MEM //OK
       FPU | MEM //OK: FPU may issue from Lane 2 (if Lane 1 is not FPU)
   ...

But, not:
FPU | FPU //Bad: Only 1 FPU exists (can only Co-Issue for 128b SIMD).
MEM | MEM //Bad: Mem only has 1 port, Lane 1 Only
BRA | ALU //Bad: Branch is Lane 1 only
MEM | ALU //Bad: Mem is Lane 1 only
MUL | ALU //Bad: Multiplier is Lane 1 only

So, for example, having 3 lanes does not mean one has 3 FPUs, 3
Load/Store ports, ...

It does mean I have 3 ALUs and 3 Shift Units, ..., but this is probably
slightly less impressive.

Most more expensive mechanisms are kept as exclusive to Lane 1.
This also avoids wasting lots of resources on functional units which
probably wont be used in practice.

Logic for most rarely used instructions is also exclusive to Lane 1.

But, like, if one wants to do 3 ALU operations at the same time, do
multiple shifts at the same time, one can "knock themselves out".
There are also enough random ALU instructions laying around that one can
gain a nice boost to ILP from parallelizing these.

I had at one point considered eliminating the Lane 3 shift-unit, as it
seemed like a bit of an obscure edge case. However, then I realized that
it was being used a bit more than I had thought (there was actually an
unexpectedly high number of random shift operations and similar being
offloaded to Lane 3).

However, going wider than 3 isn't so nice:
Costs due to things like register ports and forwarding start to go up
rapidly;
The amount of "free ILP" laying around mostly falls off the table.
    One would then need to open the "pandora's box" of having multiple
memory ports.
    At 2 or 3-wide, Lane 1 is already pretty much back-to-back memory ops.

...

Jeff Scott

unread,

Apr 13, 2022, 11:09:40 AM4/13/22

to Allen Baum, BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com

Unfortunately, Chris never implemented fusion, at least at the time he presented at the 2016 RISC-V Summit (available on RISC-V International youtube channel). Decode time in a processor is very important. With fusion and the C-extension it now becomes a longer path impacting your frequency. I don’t think fusion is cheap from this point of view.

Jeff

From: Allen Baum <allen...@esperantotech.com>
Sent: Tuesday, April 12, 2022 9:10 PM
To: BGB <cr8...@gmail.com>
Cc: Robert Finch <robf...@gmail.com>; RISC-V ISA Dev <isa...@groups.riscv.org>; Philipp Tomsich <philipp...@vrull.eu>; jim.wil...@gmail.com <jim.wil...@gmail.com>
Subject: [EXT] Re: [isa-dev] Misc (Maybe controversial): Scaled Index Load/Store?

Caution: EXT Email

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAF4tt%3DDnzRapW04xkjxv0gPptFPBweQev20i47rmsry4whJt8Q%40mail.gmail.com.

Richard Herveille

unread,

Apr 13, 2022, 11:21:51 AM4/13/22

to Jeff Scott, Allen Baum, BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com, Richard Herveille

With fusion and the C-extension it now becomes a longer path impacting your frequency

C-extension didn’t add much delay to the decode part.

But adding fusion did create a large delay (probably just my implementation).

Biggest issue is that we need to merge combinations of {rvc,rvc}, {rv,rvc}, {rvc,rv}, {rv,rv}.

That turned into a big delay. Even if only looking add immediates.

Richard

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/AM6PR04MB6053BFAA2A81FEC5FEDA12F68DEC9%40AM6PR04MB6053.eurprd04.prod.outlook.com.

Jeff Scott

unread,

Apr 13, 2022, 11:30:38 AM4/13/22

to Richard Herveille, Allen Baum, BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com

Good data Richard. For context, is you machine single issue? The issue becomes bigger for superscalar implementations.

Jeff

Richard Herveille

unread,

Apr 13, 2022, 1:07:16 PM4/13/22

to Jeff Scott, Allen Baum, BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com, Richard Herveille

It’s a single issue.

Robert Finch

unread,

Apr 13, 2022, 1:18:46 PM4/13/22

to RISC-V ISA Dev, Richard Herveille, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com, Jeff Scott, Allen Baum, cr8...@gmail.com

I think for machines with an elastic pipeline like ooo superscalar machines, it should be possible pipeline the pattern match / decode across multiple clock cycles. It would add to the latency of the instruction, but multiple decoders could be added to alleviate this.

Could there be a fusion prefix to indicate what type of fusion is taking place so that the instructions do not need to be decoded to find out?

Allen Baum

unread,

Apr 13, 2022, 2:21:58 PM4/13/22

to Richard Herveille, Jeff Scott, BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com

My philosophy: if it hurts when you do that, don't do that.

The low hanging fruit is {rvc,rvc};

The really easy low hanging fruit is aligned {rvc, rvc}.

This should fit nicely into a single issue machine.

Richard Herveille

unread,

Apr 13, 2022, 2:55:25 PM4/13/22

to Allen Baum, Jeff Scott, BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com, Richard Herveille

Aligned or non-aligned doesn’t matter in my implementation.

I need to fix a few other things. Next I’ll go back to macro-fusion.

Richard

Richard Herveille

unread,

Apr 13, 2022, 2:56:20 PM4/13/22

to Allen Baum, Jeff Scott, BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com, Richard Herveille

I need to fix a few other things. Next I’ll go back to macro-fusion

That came out weird.

First I need to fix a few other things, then I’ll go back to macro-fusion.

Andrew Waterman

unread,

Apr 13, 2022, 3:13:17 PM4/13/22

to Allen Baum, Richard Herveille, Jeff Scott, BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com

Also note that, although the microarchitectural details are not publicly available, there exist commercial implementations of RISC-V, ARMv8 and x86-64 that perform fusion, some at [very] high clock rates. No need to debate whether it's feasible, since there are existence proofs.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAF4tt%3DAN6fkmcPhd%3D4f1J-VvSznoQxM8RKzoBzWnO0_b-vMedQ%40mail.gmail.com.

Edwin Sutanto

unread,

Apr 13, 2022, 3:32:18 PM4/13/22

to Andrew Waterman, Allen Baum, Richard Herveille, Jeff Scott, BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com

x86 is limited to cmp+br though

the br depends on eflags

Edwin

Sent from my iPhone

On Apr 13, 2022, at 12:13 PM, Andrew Waterman <and...@sifive.com> wrote:

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CA%2B%2B6G0D7YZ9L6YHktbxTQA0GFOU5wRr-EAqDS%3DLpVZifYkU1yQ%40mail.gmail.com.

Andrew Waterman

unread,

Apr 13, 2022, 3:52:53 PM4/13/22

to Edwin Sutanto, Allen Baum, Richard Herveille, Jeff Scott, BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com

On Wed, Apr 13, 2022 at 12:32 PM Edwin Sutanto <esut...@gmail.com> wrote:

x86 is limited to cmp+br though
the br depends on eflags

True enough, AFAIK. But high-performance ARMv8 processors are performing fusion based upon comparing register identifiers, including patterns analogous to some described in Celio's paper. See https://github.com/gcc-mirror/gcc/blob/f2ebf2d98efe0ac2314b58cf474f44cb8ebd5244/gcc/config/aarch64/aarch64.cc#L25258

Some vendors of high-performance RISC-V processors are doing so, too; see e.g. https://www.mail-archive.com/gcc-p...@gcc.gnu.org/msg274240.html

John Leidel

unread,

Apr 13, 2022, 3:54:47 PM4/13/22

to Andrew Waterman, Edwin Sutanto, Allen Baum, Richard Herveille, Jeff Scott, BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com

Andrew, do you have a reference or title to Celio's paper? Might be
good to the general populous to take a look.

best
john

> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CA%2B%2B6G0B56_9i0jm7we-fwP2P5wXXv9r-WevZHTs4-yJeG5Vq7g%40mail.gmail.com.

Andrew Waterman

unread,

Apr 13, 2022, 3:56:40 PM4/13/22

to John Leidel, Edwin Sutanto, Allen Baum, Richard Herveille, Jeff Scott, BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com

On Wed, Apr 13, 2022 at 12:54 PM John Leidel <john....@gmail.com> wrote:

Andrew, do you have a reference or title to Celio's paper? Might be
good to the general populous to take a look.

Yeah, it is here: https://arxiv.org/abs/1607.02318

(But, as others on this thread has mentioned, it doesn't address the implementation challenges being discussed on this thread.)

Greg Favor

unread,

Apr 13, 2022, 3:59:47 PM4/13/22

to Andrew Waterman, Edwin Sutanto, Allen Baum, Richard Herveille, Jeff Scott, BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com

On Wed, Apr 13, 2022 at 12:52 PM Andrew Waterman <and...@sifive.com> wrote:

True enough, AFAIK. But high-performance ARMv8 processors are performing fusion based upon comparing register identifiers, including patterns analogous to some described in Celio's paper. See https://github.com/gcc-mirror/gcc/blob/f2ebf2d98efe0ac2314b58cf474f44cb8ebd5244/gcc/config/aarch64/aarch64.cc#L25258

Some vendors of high-performance RISC-V processors are doing so, too; see e.g. https://www.mail-archive.com/gcc-p...@gcc.gnu.org/msg274240.html

Yes, that design achieves high frequency (and a typical length decode pipeline length). "Decompressing" compressed instructions and recognizing instruction-pair fusion opportunities all happens at the beginning of the decode pipe (i.e. not over several pipe stages).

Greg

Edwin Sutanto

unread,

Apr 13, 2022, 4:00:53 PM4/13/22

to Andrew Waterman, Allen Baum, Richard Herveille, Jeff Scott, BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com

RiscV being “simple enough”, wouldn’t that make decoder cost and latency not as prohibitive as in v8 and x86?

Edwin

Sent from my iPhone

On Apr 13, 2022, at 12:52 PM, Andrew Waterman <and...@sifive.com> wrote:

Jeff Scott

unread,

Apr 13, 2022, 4:52:48 PM4/13/22

to Edwin Sutanto, Andrew Waterman, Allen Baum, Richard Herveille, BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com

I think fusion is less painful for higher performance cores where there is already a deep pipeline. I think in the lower end, it will affect frequency, without additional pipeline stages. This may be feasible, but is certainly not desirable over simply adding a new opcode.

Jeff

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/2F79A2F2-8B34-4241-8655-1E2807BC8B6B%40gmail.com.

Sean Halle

unread,

Apr 13, 2022, 5:06:32 PM4/13/22

to Edwin Sutanto, Andrew Waterman, Allen Baum, Richard Herveille, Jeff Scott, BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com

With ISA changes being pushed through quickly.. RISC-V may find itself with all the baggage of v8. Consider that ARM started out even simpler than RISC-V. Now look at it.

You can't subtract instructions from the ISA once they're embedded into binaries. Doesn't matter how good they looked at the moment, once you sober up.. your hitched. No going backwards.

Personally, I appreciate this discussion.. not just the desire to put in new ISA features, because there's a natural human tendency to "improve".. but I also greatly appreciate the push in the opposite direction.. any change to the "standard set" of extensions to which distros like Fedora and Ubuntu are compiled (and so binaries downloaded require) involves winners and losers.. it is difficult to find new instructions that provide a win across a strong majority (70%? 90%?) of producers of RISC-V chips that are used to run Linux.. so I appreciate the push to avoid adding instructions to that standard set.. there needs to be a powerful case with extensive benchmarking across a wide variety of micro-architectures in order to justify imposing those new instructions on all vendors, because all of their chips _must_ run the downloaded binaries in order to be commercially viable. (If customers of your chip do apt-get to install a package.. and it crashes because your chip doesn't implement an extension to which one dependency in the package was compiled.. you go out of business, literally).

In other words, the idea that "you're always free to not implement an extension" is not true in the real world of companies that produce silicon. The software eco system provides binary releases, via apt-get, yum, etc. That eco-system is such a steep barrier that any silicon that does not support the instructions embedded into the binaries is not viable.

IE, RISC-V has matured. Real world things now place constraints that weren't there when it was still a research project. The idea of extensions is fun to think about, but the commercial reality is that.. such extensions are niche curiosities in terms of the number of users. Implementing any extensions outside of IMAFDC means the provider of the silicon has to compile all OS distributions their customers want, and force customers to recompile everything themselves (libraries, dependencies, etc) if their customers want the benefit of the additional extensions.. not many customers are open to this. For niche products, and embedded products that build all their own binaries top to bottom, it can work.. for large scale shipment.. not happening outside of fantasy stories.

Which means that we need to be very very careful when considering any instructions that may make their way into an extension that is used widely in the eco-system (such as embedded into binaries fetched by yum or apt-get..). Those have a high commercial impact (and we need a very high bar in terms of voting in order to accept such a change).

For extensions that are intended for niche markets.. that's a different story :-)

(Note, the above is orthogonal fusion, of course, because fusion is hidden beneath the ISA, just sharing concerns about ISA creep from one producing silicon that runs Linux).

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/2F79A2F2-8B34-4241-8655-1E2807BC8B6B%40gmail.com.

Krste Asanovic

unread,

Apr 13, 2022, 5:30:02 PM4/13/22

to Sean Halle, Edwin Sutanto, Andrew Waterman, Allen Baum, Richard Herveille, Jeff Scott, BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com

> On Apr 13, 2022, at 2:05 PM, Sean Halle <sean...@gmail.com> wrote:
>
> Consider that ARM started out even simpler than RISC-V.

Please point at an early ARM ISA spec that is simpler than the base RISC-V ISA.

Certainly not the original 1985-era 32-bit design, and definitely not the first 64-bit architecture released in 2011.

I’m not trying to present a value judgement here on whether simple or complex is good, just wanting to correct this mistaken belief (oft-repeated) that ARM started simple.

Krste

Philipp Tomsich

unread,

Apr 13, 2022, 5:34:10 PM4/13/22

to Sean Halle, Edwin Sutanto, Andrew Waterman, Allen Baum, Richard Herveille, Jeff Scott, BGB, Robert Finch, RISC-V ISA Dev, jim.wil...@gmail.com

On Wed, 13 Apr 2022 at 23:06, Sean Halle <sean...@gmail.com> wrote:

IE, RISC-V has matured. Real world things now place constraints that weren't there when it was still a research project. The idea of extensions is fun to think about, but the commercial reality is that.. such extensions are niche curiosities in terms of the number of users. Implementing any extensions outside of IMAFDC means the provider of the silicon has to compile all OS distributions their customers want, and force customers to recompile everything themselves (libraries, dependencies, etc) if their customers want the benefit of the additional extensions.. not many customers are open to this. For niche products, and embedded products that build all their own binaries top to bottom, it can work.. for large scale shipment.. not happening outside of fantasy stories.

Which means that we need to be very very careful when considering any instructions that may make their way into an extension that is used widely in the eco-system (such as embedded into binaries fetched by yum or apt-get..). Those have a high commercial impact (and we need a very high bar in terms of voting in order to accept such a change).

Fortunately, there are methods (e.g. ifunc, multiarch library-paths, etc.) implemented in Linux to either have a single library with functions optimized for multiple architecture levels or to have the runtime-linker resolve to use shared objects that better fit a given architecture level. Key optimisations (e.g. in crypto libraries) can therefore be rolled out in backward-compatible ways on a rolling-release basis.

Philipp.

Sean Halle

unread,

Apr 13, 2022, 5:35:10 PM4/13/22

to Krste Asanovic, Edwin Sutanto, Andrew Waterman, Allen Baum, Richard Herveille, Jeff Scott, BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com

Fair enough.. the starting point may have had a bit of a head start.. but in the interim.. comparing that starting point to v8.. ?

Is v8 the desired end point?

(That's actually a genuine question.. is there a desire out there to end up with an instruction set of similar size and complexity?)

Krste Asanovic

unread,

Apr 13, 2022, 6:58:54 PM4/13/22

to Sean Halle, Edwin Sutanto, Andrew Waterman, Allen Baum, Richard Herveille, Jeff Scott, BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com

On Apr 13, 2022, at 2:34 PM, Sean Halle <sean...@gmail.com> wrote:

Fair enough.. the starting point may have had a bit of a head start.. but in the interim.. comparing that starting point to v8.. ?

Is v8 the desired end point?

(That's actually a genuine question.. is there a desire out there to end up with an instruction set of similar size and complexity?)

I believe computing needs will keep evolving, so I don’t think there is a desired end point.

The set of standard RISC-V instruction set extensions will keep growing as makes sense.

To be clear, copying every ISA feature of every other ISA does not make sense.

Krste

BGB

unread,

Apr 13, 2022, 7:19:07 PM4/13/22

to Jeff Scott, Edwin Sutanto, Andrew Waterman, Allen Baum, Richard Herveille, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com

On 4/13/2022 3:52 PM, Jeff Scott wrote:
>
> I think fusion is less painful for higher performance cores where
> there is already a deep pipeline. I think in the lower end, it will
> affect frequency, without additional pipeline stages. This may be
> feasible, but is certainly not desirable over simply adding a new opcode.
>

I agree here...

This is why I would prefer an extension which adds these as (optional)
instructions.

Given "most other ISAs" have them in some form, and these come up rather
frequently, I figure they are probably justifiable.
Leaving them as optionally allows skipping them for smaller
implementations, or implementations where they would not be advantageous.

In my case, my core has an 8-stage pipeline, roughly:
PF IF ID1 ID2 EX1 EX2 EX3 WB

The ( PF/IF ) stages deal with Fetch and PC stepping, and are moderately
tight in terms of timing.
Both fusion and (traditional) superscalar would effect these stages.
But, in the latter case, this may end up being necessary.

Memory Load/Store is handled during EX1/EX2/EX3.

Its performance is probably reasonable for a hobbyist class CPU core.

At 50MHz, memcpy speeds: L1=160MB/s, L2=65MB/s, RAM=24MB/s (limited to 1
memory access at a time; raw unidirectional DRAM bandwidth is roughly
90MB/s in this case);
Roughly 70k in Dhrystone last I checked (kinda sucks apparently, ~ 0.8
DMIPS/MHz), though this was compiling it with my own C compiler, which
is admittedly "kinda weak".

My compiler is still sort of in the category where it tends to spill
pretty much everything to the stack at the end of each basic block; also
Dhrystone seems to like to hit integer divide pretty hard, where I was
using a shift-subtract loop implemented in software (I have now added a
hardware divider, but have not yet added support for the instruction to
my compiler, or as of yet compared the relative performance of the
hardware divider and the software loop).

Though, for small leaf functions which match certain criteria, it will
static-assign everything to registers and thus not need spill variables
to the stack. My compiler generally lacks most "advanced" optimizations
though (function inlining, loop unrolling, TBAA, ...).

...

> Jeff
>
> *From:* Edwin Sutanto <esut...@gmail.com>
> *Sent:* Wednesday, April 13, 2022 3:01 PM
> *To:* Andrew Waterman <and...@sifive.com>
> *Cc:* Allen Baum <allen...@esperantotech.com>; Richard Herveille

> <richard....@roalogic.com>; Jeff Scott <jeff....@nxp.com>; BGB
> <cr8...@gmail.com>; Robert Finch <robf...@gmail.com>; RISC-V ISA Dev
> <isa...@groups.riscv.org>; Philipp Tomsich
> <philipp...@vrull.eu>; jim.wil...@gmail.com
> <jim.wil...@gmail.com>

> *Subject:* Re: [EXT] Re: [isa-dev] Misc (Maybe controversial): Scaled
> Index Load/Store?
>
> *Caution: *EXT Email

>
> RiscV being “simple enough”, wouldn’t that make decoder cost and
> latency not as prohibitive as in v8 and x86?
>
> Edwin
>
> Sent from my iPhone
>
>
>
> On Apr 13, 2022, at 12:52 PM, Andrew Waterman <and...@sifive.com>
> wrote:
>
>
>
> On Wed, Apr 13, 2022 at 12:32 PM Edwin Sutanto
> <esut...@gmail.com> wrote:
>
> x86 is limited to cmp+br though
>
> the br depends on eflags
>
> True enough, AFAIK. But high-performance ARMv8 processors are
> performing fusion based upon comparing register identifiers,
> including patterns analogous to some described in Celio's paper.
> See
> https://github.com/gcc-mirror/gcc/blob/f2ebf2d98efe0ac2314b58cf474f44cb8ebd5244/gcc/config/aarch64/aarch64.cc#L25258

> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgcc-mirror%2Fgcc%2Fblob%2Ff2ebf2d98efe0ac2314b58cf474f44cb8ebd5244%2Fgcc%2Fconfig%2Faarch64%2Faarch64.cc%23L25258&data=04%7C01%7Cjeff.scott%40nxp.com%7C7e178d39aa264d08916908da1d8851f7%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C1%7C637854768581729789%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=y86DY2om6l7NqtvpTlYhbwOh9H6RT3TtCFDI1YfbNOc%3D&reserved=0>

>
> Some vendors of high-performance RISC-V processors are doing so,
> too; see e.g.
> https://www.mail-archive.com/gcc-p...@gcc.gnu.org/msg274240.html

> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.mail-archive.com%2Fgcc-patches%40gcc.gnu.org%2Fmsg274240.html&data=04%7C01%7Cjeff.scott%40nxp.com%7C7e178d39aa264d08916908da1d8851f7%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C1%7C637854768581729789%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=XNUKYGcibbNiYeT1WtgeKqU%2BiLIqdfKd1Vxj6tAfcxM%3D&reserved=0>

>
> Edwin
>
> Sent from my iPhone
>
>
>
> On Apr 13, 2022, at 12:13 PM, Andrew Waterman
> <and...@sifive.com> wrote:
>
>
>
> Also note that, although the microarchitectural details
> are not publicly available, there exist commercial
> implementations of RISC-V, ARMv8 and x86-64 that perform
> fusion, some at [very] high clock rates. No need to
> debate whether it's feasible, since there are existence
> proofs.
>
> On Wed, Apr 13, 2022 at 11:22 AM Allen Baum
> <allen...@esperantotech.com> wrote:
>
> My philosophy: if it hurts when you do that, don't do
> that.
>
> The low hanging fruit is {rvc,rvc};
>
> The really easy low hanging fruit is aligned {rvc, rvc}.
>
> This should fit nicely into a single issue machine.
>
> On Wed, Apr 13, 2022 at 8:21 AM Richard Herveille
> <richard....@roalogic.com> wrote:
>

> * With fusion and the C-extension it now becomes

> a longer path impacting your frequency
>
> C-extension didn’t add much delay to the decode part.
>
> But adding fusion did create a large delay
> (probably just my implementation).
>
> Biggest issue is that we need to merge
> combinations of {rvc,rvc}, {rv,rvc}, {rvc,rv},
> {rv,rv}.
>
> That turned into a big delay. Even if only looking
> add immediates.
>
> Richard
>

> *From: *Jeff Scott <jeff....@nxp.com>
> *Date: *Wednesday, 13 April 2022 at 17:09
> *To: *Allen Baum <allen...@esperantotech.com>,
> BGB <cr8...@gmail.com>
> *Cc: *Robert Finch <robf...@gmail.com>, RISC-V

> ISA Dev <isa...@groups.riscv.org>, Philipp
> Tomsich <philipp...@vrull.eu>,
> jim.wil...@gmail.com <jim.wil...@gmail.com>

> *Subject: *RE: [EXT] Re: [isa-dev] Misc (Maybe

> controversial): Scaled Index Load/Store?
>
> Unfortunately, Chris never implemented fusion, at
> least at the time he presented at the 2016 RISC-V
> Summit (available on RISC-V International youtube
> channel). Decode time in a processor is very
> important. With fusion and the C-extension it now
> becomes a longer path impacting your frequency. I
> don’t think fusion is cheap from this point of view.
>
> Jeff
>

> *From:* Allen Baum <allen...@esperantotech.com>
> *Sent:* Tuesday, April 12, 2022 9:10 PM
> *To:* BGB <cr8...@gmail.com>
> *Cc:* Robert Finch <robf...@gmail.com>; RISC-V

> ISA Dev <isa...@groups.riscv.org>; Philipp
> Tomsich <philipp...@vrull.eu>;
> jim.wil...@gmail.com <jim.wil...@gmail.com>

> *Subject:* [EXT] Re: [isa-dev] Misc (Maybe
> controversial): Scaled Index Load/Store?
>
> *Caution: *EXT Email

>
> Prefix immediate ops aren't new (I know of 2
> microprocessors that had that: Transputer, and GE
> RPM, 30 years ago)
>
> History repeats. I actually have a (long expired)
> patent on that which used a CSR so you could
> macrofuse or interrupt in the middle and continue.
>
> The Mill does a lot of immediate prefixing+ slot
> register borrowing in its architecture as well).
>
> But, what Philip said, and I should have mentioned
> extensive benchmarking to see if it pays off.
>
> Fusing 2 *compressed* ops is much more attractive,
> should be fairly cheap, and is a lot more flexible.
>
> Chris Celio had numbers for that either in a
> publication or thesis (or both...)
>
> https://riscv.org/wp-content/uploads/2016/07/Tue1130celio-fusion-finalV2.pdf

> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Friscv.org%2Fwp-content%2Fuploads%2F2016%2F07%2FTue1130celio-fusion-finalV2.pdf&data=04%7C01%7Cjeff.scott%40nxp.com%7C7e178d39aa264d08916908da1d8851f7%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C1%7C637854768581729789%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=vVQ9ylCmfuWm59t8ZGOKmzcL9rr4%2Fcp7kEjEi71ODMs%3D&reserved=0>
>
> https://arxiv.org/abs/1607.02318
> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Farxiv.org%2Fabs%2F1607.02318&data=04%7C01%7Cjeff.scott%40nxp.com%7C7e178d39aa264d08916908da1d8851f7%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C1%7C637854768581729789%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=LEQuSk%2BtPPZRLqok9JOT0%2B7oEFEZ4Agaya7F%2BbSjuFg%3D&reserved=0>

> <mailto:isa-dev%2Bunsu...@groups.riscv.org>.

> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/768ca2c6-1e67-902e-912f-a8597a48fd79%40gmail.com

> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2F768ca2c6-1e67-902e-912f-a8597a48fd79%2540gmail.com&data=04%7C01%7Cjeff.scott%40nxp.com%7C7e178d39aa264d08916908da1d8851f7%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C1%7C637854768581729789%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AQowbF09QPRR9xzsP%2BRHiDJzvp24gzo4AJsnkkFy4FA%3D&reserved=0>.

>
> --
> You received this message because you are
> subscribed to the Google Groups "RISC-V ISA Dev"
> group.
> To unsubscribe from this group and stop receiving
> emails from it, send an email to
> isa-dev+u...@groups.riscv.org.
> To view this discussion on the web visit

> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAF4tt%3DDnzRapW04xkjxv0gPptFPBweQev20i47rmsry4whJt8Q%40mail.gmail.com
> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2FCAF4tt%253DDnzRapW04xkjxv0gPptFPBweQev20i47rmsry4whJt8Q%2540mail.gmail.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjeff.scott%40nxp.com%7C7e178d39aa264d08916908da1d8851f7%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C1%7C637854768581729789%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=mz7wg4H2jiVo3CUyTm5gVSXPkvD1nVCpJyeTY0%2F972I%3D&reserved=0>.

>
> --
> You received this message because you are
> subscribed to the Google Groups "RISC-V ISA Dev"
> group.
> To unsubscribe from this group and stop receiving
> emails from it, send an email to
> isa-dev+u...@groups.riscv.org.
> To view this discussion on the web visit

> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/AM6PR04MB6053BFAA2A81FEC5FEDA12F68DEC9%40AM6PR04MB6053.eurprd04.prod.outlook.com
> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2FAM6PR04MB6053BFAA2A81FEC5FEDA12F68DEC9%2540AM6PR04MB6053.eurprd04.prod.outlook.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjeff.scott%40nxp.com%7C7e178d39aa264d08916908da1d8851f7%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C1%7C637854768581729789%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=k3WFceNZ3yDwK%2BMOlmBRuGDXROilgQqhO%2Byy6rk7mmM%3D&reserved=0>.

>
> --
> You received this message because you are subscribed
> to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving
> emails from it, send an email to
> isa-dev+u...@groups.riscv.org.
> To view this discussion on the web visit

> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAF4tt%3DAN6fkmcPhd%3D4f1J-VvSznoQxM8RKzoBzWnO0_b-vMedQ%40mail.gmail.com
> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2FCAF4tt%253DAN6fkmcPhd%253D4f1J-VvSznoQxM8RKzoBzWnO0_b-vMedQ%2540mail.gmail.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjeff.scott%40nxp.com%7C7e178d39aa264d08916908da1d8851f7%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C1%7C637854768581729789%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2Bf0%2FLDmgCc238%2BfzOg2ENKXtlpiqf68TvAhSzLEZnio%3D&reserved=0>.

>
> --
> You received this message because you are subscribed to
> the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails
> from it, send an email to
> isa-dev+u...@groups.riscv.org.
> To view this discussion on the web visit

> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CA%2B%2B6G0D7YZ9L6YHktbxTQA0GFOU5wRr-EAqDS%3DLpVZifYkU1yQ%40mail.gmail.com
> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2FCA%252B%252B6G0D7YZ9L6YHktbxTQA0GFOU5wRr-EAqDS%253DLpVZifYkU1yQ%2540mail.gmail.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjeff.scott%40nxp.com%7C7e178d39aa264d08916908da1d8851f7%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C1%7C637854768581729789%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2FpdFv3RNFxmZynli3oUyx1EbdQhya5N826Ro9SKfs4Y%3D&reserved=0>.

>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to isa-dev+u...@groups.riscv.org.
> To view this discussion on the web visit

> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/2F79A2F2-8B34-4241-8655-1E2807BC8B6B%40gmail.com
> <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2F2F79A2F2-8B34-4241-8655-1E2807BC8B6B%2540gmail.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=04%7C01%7Cjeff.scott%40nxp.com%7C7e178d39aa264d08916908da1d8851f7%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C1%7C637854768581729789%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8vPKCAajcxITU3hXpDPcJgh95laYSC1LxdIc48rkO7U%3D&reserved=0>.
>

BGB

unread,

Apr 13, 2022, 9:36:34 PM4/13/22

to Krste Asanovic, Sean Halle, Edwin Sutanto, Andrew Waterman, Allen Baum, Richard Herveille, Jeff Scott, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com

On 4/13/2022 5:58 PM, Krste Asanovic wrote:
>
>
>> On Apr 13, 2022, at 2:34 PM, Sean Halle <sean...@gmail.com> wrote:
>>
>>
>> Fair enough.. the starting point may have had a bit of a head start..
>> but in the interim.. comparing that starting point to v8.. ?
>>
>> Is v8 the desired end point?
>>
>> (That's actually a genuine question.. is there a desire out there to
>> end up with an instruction set of similar size and complexity?)
>>
>
> I believe computing needs will keep evolving, so I don’t think there
> is a desired end point.
> The set of standard RISC-V instruction set extensions will keep
> growing as makes sense.
> To be clear, copying every ISA feature of every other ISA does not
> make sense.
>

Agreed.

Minimizing the ISA listing isn't as much of a gain as it may seem, and
having a longer listing doesn't necessarily hurt as much as it may seem.
What more matters is things like which sort of behaviors the CPU is
expected to perform, and how these effect cost and other factors, and
what sorts of features are or are not required for an implementation, ...

But, I can also note that ARMv8 has a lot of features which I would not
be in favor of.

And, there are lots of CPU features that are better relegated to the past:
Register/Memory (x86) and Memory/Memory (M68K/PDP/...) addressing modes;
Branch delay slots;
...

I am more in favor of things like:
Strict Load/Store;
No to using memory as input or destination of an arithmetic operator.
Keeping everything in GPRs
Ideally, no separate FPU or SIMD/Vector registers.

Sometimes complexity is justifiable.
Sometimes it is unavoidable.

In my own ISA design, there are things I don't like, but there is not
always a good way to "fix" them, and trying to do something differently
may introduce worse problems than it solves.

Say, if I try to design something clean that has all the features I want
up-front, it may end up being "technically worse" than the more ad-hoc
design I had been using already (built from several years of layering on
hacks...). Even if the older/hackish design doesn't necessarily win as
many points on the "design elegance" front.

Richard Herveille

unread,

Apr 14, 2022, 2:57:57 AM4/14/22

to Greg Favor, Andrew Waterman, Edwin Sutanto, Allen Baum, Jeff Scott, BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com, Richard Herveille

Decompressing" compressed instructions and recognizing instruction-pair fusion opportunities all happens at the beginning of the decode pipe

Yup, that’s where I handle them, right at the IF stage (first pipeline stage).

Still the many combinations is a problem (again, most likely issue with my implementation/architecture).

It was stated before that {rvc,rvc} is the low hanging fruit, but looking at the code that gets compiled, there are very little of those. Most obvious ones are {rv,rv} (addi,auipc}, where there is no rvc-code for auipc and the rvc-addi doesn’t have enough range. (We run mostly embedded/FreeRTOS on our CPU).

Other obvious ones are the lui/add(i), add(i)/load, add(i)/store combinations, but here many versions are generated forcing {rvc,rvc},{rv,rvc},{rv,rv} to be supported.

And finally there are no lb/sb/lh/sh versions, which are very popular in embedded apps.

Richard

From: Greg Favor <gfa...@ventanamicro.com>

Date: Wednesday, 13 April 2022 at 21:59
To: Andrew Waterman <and...@sifive.com>

Jeff Scott

unread,

Apr 14, 2022, 10:16:06 AM4/14/22

to Richard Herveille, Greg Favor, Andrew Waterman, Edwin Sutanto, Allen Baum, BGB, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com

Richard, Zce gives the lb/sb/lh/sh versions. I know there were many spec changes going on for Zce. Not sure if the dust has settled yet.

Jeff

From: Richard Herveille <richard....@roalogic.com>
Sent: Thursday, April 14, 2022 1:58 AM
To: Greg Favor <gfa...@ventanamicro.com>; Andrew Waterman <and...@sifive.com>
Cc: Edwin Sutanto <esut...@gmail.com>; Allen Baum <allen...@esperantotech.com>; Jeff Scott <jeff....@nxp.com>; BGB <cr8...@gmail.com>; Robert Finch <robf...@gmail.com>; RISC-V ISA Dev <isa...@groups.riscv.org>; Philipp Tomsich <philipp...@vrull.eu>; jim.wil...@gmail.com <jim.wil...@gmail.com>; Richard Herveille <richard....@roalogic.com>
Subject: Re: [EXT] Re: [isa-dev] Misc (Maybe controversial): Scaled Index Load/Store?

Caution: EXT Email

Torbjørn Viem Ness

unread,

Apr 15, 2022, 8:31:59 AM4/15/22

to RISC-V ISA Dev, Jeff Scott, esut...@gmail.com, Allen Baum, cr8...@gmail.com, Robert Finch, RISC-V ISA Dev, Philipp Tomsich, jim.wil...@gmail.com, Richard Herveille, gfa...@ventanamicro.com, andrew

Zce is getting relatively close to stable now. I don't expect major changes to the instructions going into the current batch of stuff to be ratified.

At least c.lbu, c.lh, c.lhu, c.sb and c.sh are in the basic set (the others are in an extended set which reuses some opcodes and is incompatible with the D extension).

Latest spec can be found here if you're interested in further reading: https://github.com/riscv/riscv-code-size-reduction/blob/main/Zc-specification/Zc.adoc

Reply all

Reply to author

Forward