So, best I can tell there doesn't seem to be any other proposals around for encodings for scaled-index load/store encodings, so I went and came up with something as a possibility.These would be located within a seemingly unused corner of the 'A' extension's encoding space.The instructions would allow for Load/Store using two registers to calculate the address (with an index that is shifted left by 0..3 bits).
For context, I have a core that (mostly) runs mu own ISA, but can also run RV64 as a sub-mode (currently RV64IM, plan is currently for RV64IMC_ZfinxZdinx).The native ISA has this addressing mode, and it occurs frequently enough in practice that it makes sense to have it available in RV mode as well. Though, would prefer the encodings to not to stomp on other extensions which may or may not exist in this space.
Currently, support for 'A', 'F', or 'D' is unlikely. 'A' doesn't really map over at all, F/D don't really match up with my FPU design (Zdinx is at least a bit closer).For 'M' some of the operations will be "absurdly slow" (eg: 67 cycle DIV/REM), but it should technically work (previously, in the native ISA was doing 64-bit multiply and divide and similar in software, via internal runtime calls). However, MULW will be a 3 cycle operation.The native ISA is a 3-wide VLIW, but it may be possible eventually to support handle RV as an in-order superscalar (at the moment it will only run RV instructions one-at-a-time). Had considered a scheme to allow load-time conversion into a VLIW form (in software, via bit-twiddling the instructions), but this mechanism would be incompatible with the 'C' extension (so is low priority).Probably enough for now...
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/1561dca7-d5dc-4025-b12d-33c642bd96cen%40groups.riscv.org.
So, best I can tell there doesn't seem to be any other proposals around for encodings for scaled-index load/store encodings, so I went and came up with something as a possibility.
These would be located within a seemingly unused corner of the 'A' extension's encoding space.
Brendan,On Tue, 12 Apr 2022 at 00:21, Brendan Bohannon <cr8...@gmail.com> wrote:So, best I can tell there doesn't seem to be any other proposals around for encodings for scaled-index load/store encodings, so I went and came up with something as a possibility.These would be located within a seemingly unused corner of the 'A' extension's encoding space.The instructions would allow for Load/Store using two registers to calculate the address (with an index that is shifted left by 0..3 bits).Zba already provides a full set of shift-by-{0,1,2,3}-and-add instructions that can be used for the address-calculation: add(.uw)?, sh[123]add(.uw)?.These also cover the use-cases for unsigned 32bit indices on RV64 (i.e., ".uw" variants that zero-extend the index before shifting).Note that these Zba-instruction can be combined with an existing load/store (and could be fused into a single op using instruction fusion) for the same effect as your proposal, without taking up additional codepoints from our opcode space.
For context, I have a core that (mostly) runs mu own ISA, but can also run RV64 as a sub-mode (currently RV64IM, plan is currently for RV64IMC_ZfinxZdinx).The native ISA has this addressing mode, and it occurs frequently enough in practice that it makes sense to have it available in RV mode as well. Though, would prefer the encodings to not to stomp on other extensions which may or may not exist in this space.As indicated above, instruction-fusion for Zba instructions with load/store instructions will give you the desired result w/o consuming additional instruction space.
On Mon, Apr 11, 2022 at 3:21 PM Brendan Bohannon <cr8...@gmail.com> wrote:So, best I can tell there doesn't seem to be any other proposals around for encodings for scaled-index load/store encodings, so I went and came up with something as a possibility.Alibaba/T-Head cores have this, and it is already present in parts that use their cores, like the Allwinner D1. For instance, they have a lrb instruction that doesrd = sign_extend(mem[(rs1+rs2<<imm2)+7: (rs1+rs2<<imm2)])The 7: looks a little confusing to me, but I think it means all bits from 0 to 7, i.e. 8 bits. This instruction is encoded in the custom-0 space as it is an Alibaba extension.It is hard to get good docs for Alibaba cores outside China though, and even harder if you can't read Mandarin, or aren't handy with translation tools.
These would be located within a seemingly unused corner of the 'A' extension's encoding space.I'd prefer that we add the missing char and short atomic support in the A extension encoding space. The absence of these instructions is a major hassle for the GCC port.
On the topic of fusion:
Could there be a generic NOP instruction that provides two additional register read ports and one additional write port for fusing with a following instruction? I have added a custom instruction called ‘REG’ to my design for this purpose.
I also use fused constant extensions in my RISCV design.
I added scaled indexed addressing as custom instructions under the opcode 13 space to my own RISCV core. I have since decided to drop the scaling because my low-cost compiler could not use it a significant portion of the time. It ended up using shift and add instructions a lot of the time to scale the index. Given the scaling was used only a fraction of a percentage of the time, I figure it is not worth providing directly. To keep the limit of two register reads in an instruction, I decided to use a prefix instruction to provide extra registers for indexed store operations among other things. This amounts to a form of instruction fusing. I still think it would be a good idea to have indexed addressing represented as such in assembly language. Indexed load operations make sense to me, but stores require too many register reads to implement directly.
If you are going to have
indexed addressing in the design maybe we could keep the opcodes consistent?
I am using: |Rs3|Sc|~~|op3|Rs1|6|Rd|13| for indexed loads
and |Rs3|Sc|Rs2|Rs1|7|~~|op3|13| for indexed stores.
Sc the scale is not used. And stores will be using fused instructions.
On Mon, Apr 11, 2022 at 3:21 PM Brendan Bohannon <cr8...@gmail.com> wrote:So, best I can tell there doesn't seem to be any other proposals around for encodings for scaled-index load/store encodings, so I went and came up with something as a possibility.Alibaba/T-Head cores have this, and it is already present in parts that use their cores, like the Allwinner D1. For instance, they have a lrb instruction that doesrd = sign_extend(mem[(rs1+rs2<<imm2)+7: (rs1+rs2<<imm2)])The 7: looks a little confusing to me, but I think it means all bits from 0 to 7, i.e. 8 bits. This instruction is encoded in the custom-0 space as it is an Alibaba extension.
>Curious. As noted, I was typically seeing scaled-index operations closer
>to around 15-20%.
Okay, I was seeing about 1 to 3% of total instructions using indexed addressing (or about 4 to 12% of memory ops). Of course, my dataset was limited to boot functions and a couple of small demos. Not really enough statistically to be very valid, but enough to indicate that the crowd numbers IIRC (numbers read from architecture texts) were probably correct. This was more than the 1% sort of minimum I had for an instruction feature to qualify as a must include. Of the 1% only about ¼ effectively used index scaling. The issue is that the compiler would not evaluate the index expression in an order suitable for the optimizer to be able to use it. I could maybe improve the compiler in this regard, but it is not a simple task. It would have to evaluate expressions multiple times in multiple orders then choose the best ordering.
I was looking at using generic fuse instructions (prefixes basically) because then the core does not need to pattern match. As soon as it sees a generic it knows to fuse it into the next instruction. I need this functionality because there are a couple of instructions that require four register read ports or two write ports, FDP – fused dot product for instance.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/d0991e7b-d504-4c94-8d7a-ffb8362ceb36n%40groups.riscv.org.
I don’t disagree Allen, but PowerPC and Arm both included Reg+Reg addressing. Add that to the wide variety of implementations in this thread, as well as RISCY’s, and I think there is justification for a standard extension which supports it.
Jeff
From: Allen Baum <allen...@esperantotech.com>
Sent: Tuesday, April 12, 2022 2:51 PM
To: Robert Finch <robf...@gmail.com>
Cc: RISC-V ISA Dev <isa...@groups.riscv.org>; Philipp Tomsich <philipp...@vrull.eu>; cr8...@gmail.com <cr8...@gmail.com>; jim.wil...@gmail.com <jim.wil...@gmail.com>
Subject: [EXT] Re: [isa-dev] Misc (Maybe controversial): Scaled Index Load/Store?
Caution: EXT Email
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAF4tt%3DAfQXSWu6jNyGruo26ODVWbikb%2BuDPp6C9jjkKdeAQyRQ%40mail.gmail.com.
Hi Philipp,
No data to share unfortunately. I have never done fusion before, but the concept intrigues me. Any idea how many gates it takes to implement fusion like this compared to having a dedicated instruction? I assume the cost increases with issue width? Frequency impact?
Jeff
So nice to not have to worry about area and frequency! 😊
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/AM6PR04MB6053538BEA5C6F42D256BBA58DED9%40AM6PR04MB6053.eurprd04.prod.outlook.com.
Thanks Edwin!
Taking a look…
Jeff
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/768ca2c6-1e67-902e-912f-a8597a48fd79%40gmail.com.
On Wednesday, 13 April 2022 12:53:48 PM NZST BGB wrote:
> Yeah, opcode fusion is one of those things that looks good in a
> theoretical / on-paper sense, but less good to someone who frequently
> battles with resource cost and timing constraints...
>
> Things like "if(A==B)" and "C=A+B;" are not really your friends when it
> comes to cost or timing (at least on an FPGA).
or "now your 6 load/store units need an extra register read port each" ....
Paul
Unfortunately, Chris never implemented fusion, at least at the time he presented at the 2016 RISC-V Summit (available on RISC-V International youtube channel). Decode time in a processor is very important. With fusion and the C-extension it now becomes a longer path impacting your frequency. I don’t think fusion is cheap from this point of view.
Jeff
From: Allen Baum <allen...@esperantotech.com>
Sent: Tuesday, April 12, 2022 9:10 PM
To: BGB <cr8...@gmail.com>
Cc: Robert Finch <robf...@gmail.com>; RISC-V ISA Dev <isa...@groups.riscv.org>; Philipp Tomsich <philipp...@vrull.eu>; jim.wil...@gmail.com <jim.wil...@gmail.com>
Subject: [EXT] Re: [isa-dev] Misc (Maybe controversial): Scaled Index Load/Store?
Caution: EXT Email
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAF4tt%3DDnzRapW04xkjxv0gPptFPBweQev20i47rmsry4whJt8Q%40mail.gmail.com.
C-extension didn’t add much delay to the decode part.
But adding fusion did create a large delay (probably just my implementation).
Biggest issue is that we need to merge combinations of {rvc,rvc}, {rv,rvc}, {rvc,rv}, {rv,rv}.
That turned into a big delay. Even if only looking add immediates.
Richard
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/AM6PR04MB6053BFAA2A81FEC5FEDA12F68DEC9%40AM6PR04MB6053.eurprd04.prod.outlook.com.
Good data Richard. For context, is you machine single issue? The issue becomes bigger for superscalar implementations.
Jeff
It’s a single issue.
I think for machines with an elastic pipeline like ooo superscalar machines, it should be possible pipeline the pattern match / decode across multiple clock cycles. It would add to the latency of the instruction, but multiple decoders could be added to alleviate this.
Could there be a fusion prefix to indicate what type of fusion is taking place so that the instructions do not need to be decoded to find out?
Aligned or non-aligned doesn’t matter in my implementation.
I need to fix a few other things. Next I’ll go back to macro-fusion.
Richard
That came out weird.
First I need to fix a few other things, then I’ll go back to macro-fusion.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAF4tt%3DAN6fkmcPhd%3D4f1J-VvSznoQxM8RKzoBzWnO0_b-vMedQ%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CA%2B%2B6G0D7YZ9L6YHktbxTQA0GFOU5wRr-EAqDS%3DLpVZifYkU1yQ%40mail.gmail.com.
x86 is limited to cmp+br thoughthe br depends on eflags
Andrew, do you have a reference or title to Celio's paper? Might be
good to the general populous to take a look.
True enough, AFAIK. But high-performance ARMv8 processors are performing fusion based upon comparing register identifiers, including patterns analogous to some described in Celio's paper. See https://github.com/gcc-mirror/gcc/blob/f2ebf2d98efe0ac2314b58cf474f44cb8ebd5244/gcc/config/aarch64/aarch64.cc#L25258Some vendors of high-performance RISC-V processors are doing so, too; see e.g. https://www.mail-archive.com/gcc-p...@gcc.gnu.org/msg274240.html
On Apr 13, 2022, at 12:52 PM, Andrew Waterman <and...@sifive.com> wrote:
I think fusion is less painful for higher performance cores where there is already a deep pipeline. I think in the lower end, it will affect frequency, without additional pipeline stages. This may be feasible, but is certainly not desirable over simply adding a new opcode.
Jeff
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/2F79A2F2-8B34-4241-8655-1E2807BC8B6B%40gmail.com.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/2F79A2F2-8B34-4241-8655-1E2807BC8B6B%40gmail.com.
IE, RISC-V has matured. Real world things now place constraints that weren't there when it was still a research project. The idea of extensions is fun to think about, but the commercial reality is that.. such extensions are niche curiosities in terms of the number of users. Implementing any extensions outside of IMAFDC means the provider of the silicon has to compile all OS distributions their customers want, and force customers to recompile everything themselves (libraries, dependencies, etc) if their customers want the benefit of the additional extensions.. not many customers are open to this. For niche products, and embedded products that build all their own binaries top to bottom, it can work.. for large scale shipment.. not happening outside of fantasy stories.
Which means that we need to be very very careful when considering any instructions that may make their way into an extension that is used widely in the eco-system (such as embedded into binaries fetched by yum or apt-get..). Those have a high commercial impact (and we need a very high bar in terms of voting in order to accept such a change).
On Apr 13, 2022, at 2:34 PM, Sean Halle <sean...@gmail.com> wrote:Fair enough.. the starting point may have had a bit of a head start.. but in the interim.. comparing that starting point to v8.. ?Is v8 the desired end point?
(That's actually a genuine question.. is there a desire out there to end up with an instruction set of similar size and complexity?)
Yup, that’s where I handle them, right at the IF stage (first pipeline stage).
Still the many combinations is a problem (again, most likely issue with my implementation/architecture).
It was stated before that {rvc,rvc} is the low hanging fruit, but looking at the code that gets compiled, there are very little of those. Most obvious ones are {rv,rv} (addi,auipc}, where there is no rvc-code for auipc and the rvc-addi doesn’t have enough range. (We run mostly embedded/FreeRTOS on our CPU).
Other obvious ones are the lui/add(i), add(i)/load, add(i)/store combinations, but here many versions are generated forcing {rvc,rvc},{rv,rvc},{rv,rv} to be supported.
And finally there are no lb/sb/lh/sh versions, which are very popular in embedded apps.
Richard
From: Greg Favor <gfa...@ventanamicro.com>
Date: Wednesday, 13 April 2022 at 21:59
To: Andrew Waterman <and...@sifive.com>
Richard, Zce gives the lb/sb/lh/sh versions. I know there were many spec changes going on for Zce. Not sure if the dust has settled yet.
Jeff
From: Richard Herveille <richard....@roalogic.com>
Sent: Thursday, April 14, 2022 1:58 AM
To: Greg Favor <gfa...@ventanamicro.com>; Andrew Waterman <and...@sifive.com>
Cc: Edwin Sutanto <esut...@gmail.com>; Allen Baum <allen...@esperantotech.com>; Jeff Scott <jeff....@nxp.com>; BGB <cr8...@gmail.com>; Robert Finch <robf...@gmail.com>; RISC-V ISA Dev <isa...@groups.riscv.org>; Philipp Tomsich <philipp...@vrull.eu>;
jim.wil...@gmail.com <jim.wil...@gmail.com>; Richard Herveille <richard....@roalogic.com>
Subject: Re: [EXT] Re: [isa-dev] Misc (Maybe controversial): Scaled Index Load/Store?
Caution: EXT Email