Feedback from IAR on the Zicond RISC-V extension

107 views
Skip to first unread message

Anders Lindgren

unread,
May 9, 2023, 3:03:48 AM5/9/23
to isa...@groups.riscv.org

Hi!

This is feedback from IAR on the proposed Zicond extension.

Summary: With Zicond, we see no gain in code size and (if the hardware
supports fused instruction) no gain in speed either. If conditional
instructions are of interest for RISC-V we believe that a "conditional
move" would be more useful, as it can also be used as a "conditional
clear" when "x0" is used as the source register.



* Avoid small extensions (general feedback)

Lately, the number of RISC-V extensions have exploded. This allows
designers of hardware to pick any from a large smorgasbord of
extensions.

Unfortunately, this puts tool vendors in a horrible situation.
Currently, runtime libraries are provided in a number of flavors.
However, each additional extension puts tool vendors in a situation
where they have to choose whether to increase the number of libraries
or provide suboptimal libraries for some devices.

In fact, this doesn't just apply to tool vendors, it applies to
anybody who will provide prebuilt libraries or executable files.

A better solution would be to group related instructions into larger
extensions and to enforce that devices implement all instructions in
the selected extensions.

This is not the first time I have provided this feedback to the RISC-V
community, so hopefully it will not come as a surprise.


* No gain in code size (when Rd = Rs1)

If the compressed extension is present, there is no gain in code size
for the czero instructions, when Rd and Rs1 are the same register.

For example:

    czero.nez a0, a0, a1     -- 4 bytes.

Compared to:

        c.eqz  a1, skip      -- 2 bytes.
        c.mv   a0, x0        -- 2 bytes.
    skip:

Of course, this doesn't apply when Rd and Rs1 are different registers.
However, modern compilers are very good at register allocation, so
this limitation will have very little impact in the real world.


* No gain in speed (when Rd = Rs1)

If this is considered a common pattern, hardware can easily decode the
two instructions as one large and run it without breaking the
pipeline. This is known as "instruction fusing".

In fact, if I understand correctly, there are RISC-V hardware which
already do this for all conditional branches over a single
instruction.


* The examples

I noticed that the provided examples can be translated to smaller code
when the czero extension isn't used!

For example:

    (rc == 0) ? (rs1 + rs2) : rs1

The czero specification suggested using the 6 byte sequence:

    czero.nez rd, rs2, rc
    add rd, rs1, rd               == c.add rd, rs1

However, when the result should be placed in the same register as Rs1,
only 4 bytes are needed:

         c.bnez   a2, skip
         c.add    a0, a1
    skip:

When Rd and Rs1 are different registers, 6 bytes are sufficient:

         c.bnez   a2, skip
         c.add    a1, a0
    skip:
         c.mv     a0, a1


For the bitwise "and" example, "(rc == 0) ? (rs1 & rs2) : rs1", the
situation is even worse. The specification suggests using an 8 or 10
byte sequence whereas code that doesn't use the czero extension use 4
or 6 bytes of code.

Hence, the "motivating examples" provided by the czero specification
fail to provide proof that this extension provides any real benefits.


* An alternative: Conditional move

If branch free conditional instructions are of interest to RISC-V,
maybe we should consider a conditional move instead, for example:

    mv.eqz  rd, rs1, rc

This would have the following semantic:

    if rc == 0 then rd = rs1

This could use the same encoding space as the proposed czero
instructions.

When using x0 as rs1, this would behave almost like czero, with the
exception that it doesn't contain an implicit move.

    mv.eqz    rd, x0, rc     // Conditionally clear RD.

For example, conditional select (from the czero specification):

    rd = (rc == 0) ? rs1 : rs2

With czero (10 bytes, need temp register):

    czero.nez rd, rs1, rc
    czero.eqz rtmp, rs2, rc
    or        rd, rd, rtmp

With mv.cond (8 bytes, doesn't need temp register):

    mv.eqz  rd, rs1, rc
    mv.nez  rd, rs2, rc

Sincerely,
    Anders Lindgren, Lead Engineer for the IAR compiler for RISC-V

--

Anders Lindgren
Lead engineer of the IAR compiler for RISC-V
E-mail: anders....@iar.com

IAR Systems AB
Box 23051, Strandbodgatan 1
SE-750 23 Uppsala, Sweden
www.iar.com
LinkedIn

Tommy Thorn

unread,
May 9, 2023, 3:35:48 AM5/9/23
to Anders Lindgren, isa...@groups.riscv.org
I’m equally concerned with the explosion of extension proposals and the underrepresentation from highend superscalar core developers.

Your proposal: “ This would have the following semantic:

    if rc == 0 then rd = rs1”

is a good illustration of the fallacy of in-order thinking.  For any core with renaming, the above is a three input instruction and the only integer instruction with three inputs.  That’s a non-starter.

Tommy

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/adbad487-e265-0844-4499-02e953f41dab%40iar.com.

Tommy Murphy

unread,
May 9, 2023, 4:55:42 AM5/9/23
to Tommy Thorn, Anders Lindgren, isa...@groups.riscv.org
A better solution would be to group related instructions 
> into larger extensions and to enforce that devices 
> implement all instructions in the selected extensions.

Isn't that effectively what Profiles are aiming to address?


From: 'Tommy Thorn' via RISC-V ISA Dev <isa...@groups.riscv.org>
Sent: Tuesday, May 9, 2023 8:35:34 AM
To: Anders Lindgren <anders....@iar.com>
Cc: isa...@groups.riscv.org <isa...@groups.riscv.org>
Subject: Re: [isa-dev] Feedback from IAR on the Zicond RISC-V extension
 

Bruce Hoult

unread,
May 9, 2023, 9:45:25 AM5/9/23
to Anders Lindgren, isa...@groups.riscv.org
I believe Zicond is not intended to provide code size savings or execution time improvements compared to completely unrestricted code, but to *most importantly* make execution time predictable and independent of the data values, and smaller and faster than the currently possible constant-time sequences.

While some microarchitectures can pair a conditional branch over one instruction with that instruction, not all do, and I suspect even on those that do this may be subject to vagaries of instruction scheduling and whether those two instructions arrive as a pair i.e. with the branch (which would normally end a basic block) in the first issue slot of an instruction pair. In short, it can't be relied on.

Conditional move was considered in the B extension, and ruled out because it would be the only integer instruction that requires reading from three registers. If that path was gone down -- which would be a huge departure from current practice --  then there are a number of other instructions that could be considered (see the B extension working documents), but they are all pretty rare, and 3-input instructions chew up huge amounts of opcode space (and full size RISC-V integer instructions do not have implicit inputs).

Zicond is about providing constant execution time without 3-input instructions, while also being a little bit more compact than the current constant execution option of using slt/sltu to produce a 0 or 1, then neg or addi -1 to produce an all-zero or all-one mask (which I think slt/sltu should have done in the first place, but never mind), then AND/ANDN between the mask and two inputs, then OR to combine them.

As for the small extensions and library explosion situation, I think there are two normal situations:

- applications processors running shrink-wrapped software and libraries. Extensions are gathered into strictly nesting profiles, reducing the library explosion from O(2^N) to O(N/k).

- embedded systems. Some people incorporate an off-the-shelf microcontroller into their project, others license a core with a known (and optimised for their task) set of optional extensions. In either case you know in advance exactly what extensions you have and compile your crt and libc and other libraries to use exactly that set of extensions.

If the libraries from IAR are closed source (even to paying customers) then perhaps you could consider a service where customers told you exactly what extensions their CPU core has, and IAR builds a custom set of libraries (and maybe compiler) for them?

I expect such a request could be fulfilled within hours, or even minutes with an automated system.

You could of course cache the generated libraries in case another customer asked for the same combination in future.

--

L Peter Deutsch

unread,
May 9, 2023, 10:22:52 AM5/9/23
to isa...@groups.riscv.org
> Conditional move was considered in the B extension, and ruled out because it
> would be the only integer instruction that requires reading from three
> registers. ... 3-input instructions chew up huge amounts of opcode space

Anders Lindgren's proposed

if rc == 0 then rd = rs1

does not require encoding 3 input registers in the instruction; but as Tommy
Thorn pointed out, in implementations with register renaming, it does
require reading from 3 registers.

I too am concerned about the proliferation of small extensions, especially
because of what I see as an increasing number of interactions between them
(most recently Zfa / Z*inx and Zkb* / Zb*), in their impact on both hardware
and compilers, not to mention runtime libraries distributed in binary form.
I don't have any suggestions other than to advocate restraint by proposers
and setting a high bar in Architectural Review. Profiles can certainly
mitigate the issue, but I think they have an underlying assumption that
different application situations naturally call for different *clusters* of
extensions: I don't have the experience in the hardware / systems world to
evaluate that, especially for extensions like Zfa or Zb* that seem pretty
application-independent to me.

I concur with Bruce Hoult's point about constant-time conditional selection.
But without hard data, I question whether the saving in either code space or
execution time is worth the conceptual cost of yet another small extension.
And if one is going to be considered, I would favor Hoult's suggestion of

rd = (rs1 <relation> rs2 ? -1 : 0)

(relation being ne or eq, lt, and ltu), which can be fused with a following
AND/ANDN and has more uses than the current Zicond proposal. But I'm not at
all sure that this carries its own weight either.

--

L Peter Deutsch :: Aladdin Enterprises :: Healdsburg, CA & Burnaby, BC

Philipp Tomsich

unread,
May 9, 2023, 10:52:22 AM5/9/23
to L Peter Deutsch, isa...@groups.riscv.org
On Tue, 9 May 2023 at 16:22, L Peter Deutsch <gh...@major2nd.com> wrote:
>
> > Conditional move was considered in the B extension, and ruled out because it
> > would be the only integer instruction that requires reading from three
> > registers. ... 3-input instructions chew up huge amounts of opcode space
>
> Anders Lindgren's proposed
>
> if rc == 0 then rd = rs1
>
> does not require encoding 3 input registers in the instruction; but as Tommy
> Thorn pointed out, in implementations with register renaming, it does
> require reading from 3 registers.
>
> I too am concerned about the proliferation of small extensions, especially
> because of what I see as an increasing number of interactions between them
> (most recently Zfa / Z*inx and Zkb* / Zb*), in their impact on both hardware

Note that Zfa does not interact with Z{fd}inx: it explicitly requires
(any combination) of F, D, Zfh, or Q.

>
> and compilers, not to mention runtime libraries distributed in binary form.
> I don't have any suggestions other than to advocate restraint by proposers
> and setting a high bar in Architectural Review. Profiles can certainly
> mitigate the issue, but I think they have an underlying assumption that
> different application situations naturally call for different *clusters* of
> extensions: I don't have the experience in the hardware / systems world to
> evaluate that, especially for extensions like Zfa or Zb* that seem pretty
> application-independent to me.
>
> I concur with Bruce Hoult's point about constant-time conditional selection.
> But without hard data, I question whether the saving in either code space or
> execution time is worth the conceptual cost of yet another small extension.
> And if one is going to be considered, I would favor Hoult's suggestion of
>
> rd = (rs1 <relation> rs2 ? -1 : 0)
>
> (relation being ne or eq, lt, and ltu), which can be fused with a following
> AND/ANDN and has more uses than the current Zicond proposal. But I'm not at
> all sure that this carries its own weight either.
>
> --
>
> L Peter Deutsch :: Aladdin Enterprises :: Healdsburg, CA & Burnaby, BC
>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/20230509142245.4ECF7EC2D6E%40serpent.at.major2nd.com.

L Peter Deutsch

unread,
May 9, 2023, 11:20:46 AM5/9/23
to Philipp Tomsich, isa...@groups.riscv.org
> Note that Zfa does not interact with Z{fd}inx: it explicitly requires
> (any combination) of F, D, Zfh, or Q.

The Z*inx documentation says:

"In general, software that assumes the presence of the F extension is
incompatible with software that assumes the presence of the Zfinx extension,
and vice versa."

Zfa is not currently compatible with Z*inx, because Zfa as specified uses
the floating-point registers. The two are compatible functionally in
principle, but making them compatible in actuality would require an explicit
statement in the Zfa specification. This as an interaction, isn't it?

Anders Lindgren

unread,
May 9, 2023, 11:41:44 AM5/9/23
to Bruce Hoult, isa...@groups.riscv.org
Den 2023-05-09 kl. 15:45, skrev Bruce Hoult:
If the libraries from IAR are closed source (even to paying customers) then perhaps you could consider a service where customers told you exactly what extensions their CPU core has, and IAR builds a custom set of libraries (and maybe compiler) for them?

I expect such a request could be fulfilled within hours, or even minutes with an automated system.

You could of course cache the generated libraries in case another customer asked for the same combination in future.

IAR provides everything needed for customers to rebuild the libraries, including source files and IAR Embedded Workbench projects. You can enable any extension, change optimization settings etc. so for IAR this isn't a big problem.

However, hopefully there will be many other organizations providing prebuilt RISC-V libraries and applications in the future, and a diverging will make this more difficult

Jeff Scott

unread,
May 9, 2023, 11:48:42 AM5/9/23
to Tommy Thorn, Anders Lindgren, isa...@groups.riscv.org

Tommy,

 

Why is a 3 input extension a non-starter?  Many other common architectures support 3 input instructions.  This is an extension that is optional.  If you don’t want to support it because it has 3 sources, you don’t have to.  If you support F you already have the HW support for it.  What is the driving decision to say it is a non-starter?  Opcode space?

 

Jeff

 

From: 'Tommy Thorn' via RISC-V ISA Dev <isa...@groups.riscv.org>
Sent: Tuesday, May 9, 2023 2:36 AM
To: Anders Lindgren <anders....@iar.com>
Cc: isa...@groups.riscv.org

Subject: [EXT] Re: [isa-dev] Feedback from IAR on the Zicond RISC-V extension

 

Caution: This is an external email. Please take care when clicking links or opening attachments. When in doubt, report the message using the 'Report this email' button

 

BGB

unread,
May 9, 2023, 1:20:04 PM5/9/23
to isa...@groups.riscv.org
On 5/9/2023 8:45 AM, Bruce Hoult wrote:
> I believe Zicond is not intended to provide code size savings or
> execution time improvements compared to completely unrestricted code,
> but to *most importantly* make execution time predictable and
> independent of the data values, and smaller and faster than the
> currently possible constant-time sequences.
>
> While some microarchitectures can pair a conditional branch over one
> instruction with that instruction, not all do, and I suspect even on
> those that do this may be subject to vagaries of instruction scheduling
> and whether those two instructions arrive as a pair i.e. with the branch
> (which would normally end a basic block) in the first issue slot of an
> instruction pair. In short, it can't be relied on.
>
> Conditional move was considered in the B extension, and ruled out
> because it would be the only integer instruction that requires reading
> from three registers. If that path was gone down -- which would be a
> huge departure from current practice --  then there are a number of
> other instructions that could be considered (see the B extension working
> documents), but they are all pretty rare, and 3-input instructions chew
> up huge amounts of opcode space (and full size RISC-V integer
> instructions do not have implicit inputs).
>

Sometimes it is more reasonable to allow 3-input instructions to exist.
As for encoding, one can use the destination register as both a source
and destination (no need to use a 4-register encoding).

Yeah, maybe also indexed Load/Store would be nice as well, while we are
at it.


The 3-input ops could be handled in a mostly 2-input pipeline by using
two lanes for a single op in these cases (assuming a core with multiple
execute lanes). Though, may make sense to omit them for cost saving on
cores which only have a single lane (this is partly why in my ISA
design, I ended up making things like the Indexed Load/Store encodings
optional on smaller profiles; this can make it a little easier to fit
the core into something like an XC7S25 or similar).

So, while the pipeline is nominally 3 wide, any 3 input ops immediately
drop it to effectively 2-wide. Similarly, 128-bit SIMD ops drop the
pipeline to scalar operation (with all 6 read ports being used as a
virtual 128-bit 3R1W register-file, vs the usual 6R3W).


But, I have de-emphasized the "smaller end" use-cases, as practically
these have little real advantage over RISC-V (with both having ended up
being "functional similar" in this case; and RV ending up a little ahead
in terms of performance and similar due to the "GCC's code generation
doesn't suck" advantage).


Well, and on the bigger end, have ended up adding another mode which
uses a variant of my ISA's encoding scheme (*1), but uses the RV mode's
register layout, and will (presumably) use the same C ABI as normal
RISC-V. This being mostly to allow direct bidirectional function-pointer
calls between these ISA modes without needing to use thunks.

Where: R0..R31 identity-map to X0..X31, and R32..R63 will probably map
to F0..F31 assuming I add F/D support; but the FPU ops aren't exactly a
1:1 mapping. But, then again, if I add F/D (and could then probably
switch from RV64IM to RV64G, with 'A' being 'sorta').


*1: It is derived from an 'XG2' sub-mode, which does 32/64/96 bit
instructions (with some 33 and 64-bit immediate encodings), 3-wide VLIW
bundles, and uses 6-bit register fields (with 64 GPRs). Ended up calling
the new mode 'XG2RV'.

Was easy enough to add to my Verilog code, but compiler support is still
in the "TODO" category (mostly need to make it be able to handle the
different ABI rules and similar; as the C ABI's are somewhat different;
and probably a need to add the ability to generate code for the RISC-V
ISA as well, ...). TBD if any of this will be useful.


> Zicond is about providing constant execution time without 3-input
> instructions, while also being a little bit more compact than the
> current constant execution option of using slt/sltu to produce a 0 or 1,
> then neg or addi -1 to produce an all-zero or all-one mask (which I
> think slt/sltu should have done in the first place, but never mind),
> then AND/ANDN between the mask and two inputs, then OR to combine them.
>
> As for the small extensions and library explosion situation, I think
> there are two normal situations:
>
> - applications processors running shrink-wrapped software and libraries.
> Extensions are gathered into strictly nesting profiles, reducing the
> library explosion from O(2^N) to O(N/k).
>
> - embedded systems. Some people incorporate an off-the-shelf
> microcontroller into their project, others license a core with a known
> (and optimised for their task) set of optional extensions. In either
> case you know in advance exactly what extensions you have and compile
> your crt and libc and other libraries to use exactly that set of extensions.
>
> If the libraries from IAR are closed source (even to paying customers)
> then perhaps you could consider a service where customers told you
> exactly what extensions their CPU core has, and IAR builds a custom set
> of libraries (and maybe compiler) for them?
>
> I expect such a request could be fulfilled within hours, or even minutes
> with an automated system.
>
> You could of course cache the generated libraries in case another
> customer asked for the same combination in future.
>

Yeah, ISA fragmentation concerns are a pain.

My assumption had been mostly that one can tell the compiler which
options to have enabled, but yeah, one does need to build the libraries
for the specific target configuration one wants to use.

Not ideal for "PC like" use-cases, but probably easily workable for
embedded systems (where, usually, one builds directly for whatever chip
they happen to be using).
> c.mv <http://c.mv>   a0, x0        -- 2 bytes.
> c.mv <http://c.mv>     a0, a1
> E-mail: anders....@iar.com <mailto:anders....@iar.com>
>
> __ __ ____ ____ ____ __ __
>
> IAR Systems AB
> Box 23051, Strandbodgatan 1
> SE-750 23 Uppsala, Sweden
> www.iar.com <http://www.iar.com>
> LinkedIn <http://www.linkedin.com/company/iar-systems>
>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to isa-dev+u...@groups.riscv.org
> <mailto:isa-dev+u...@groups.riscv.org>.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/adbad487-e265-0844-4499-02e953f41dab%40iar.com <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/adbad487-e265-0844-4499-02e953f41dab%40iar.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to isa-dev+u...@groups.riscv.org
> <mailto:isa-dev+u...@groups.riscv.org>.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAMU%2BEkxdfeCaqGYnWaAu5eoXuQ6wiwCGRKV-qpwF3tFQTDo-UA%40mail.gmail.com <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAMU%2BEkxdfeCaqGYnWaAu5eoXuQ6wiwCGRKV-qpwF3tFQTDo-UA%40mail.gmail.com?utm_medium=email&utm_source=footer>.

kr...@sifive.com

unread,
May 9, 2023, 8:55:39 PM5/9/23
to Bruce Hoult, Anders Lindgren, isa...@groups.riscv.org

Zicond does _not_ provide data-independent timing in general, only on
rs1 input.

This choice was discussed extensively with crypto group, who did not
see the value in providing DVI versions, above and beyond existing Zkt
support. The current non-DVI definition supports uarch optimizations
when condition is known before data is ready.

The main performance effect is to avoid branches and resulting
pollution of branch-predictor tables, especially on
difficult-to-predict branches.

Krste
| https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAMU%2BEkxdfeCaqGYnWaAu5eoXuQ6wiwCGRKV-qpwF3tFQTDo-UA%40mail.gmail.com.

kr...@sifive.com

unread,
May 9, 2023, 9:02:59 PM5/9/23
to Jeff Scott, Tommy Thorn, Anders Lindgren, isa...@groups.riscv.org

>>>>> On Tue, 9 May 2023 15:48:35 +0000, Jeff Scott <jeff....@nxp.com> said:
| Tommy,
| Why is a 3 input extension a non-starter? Many other common architectures support 3 input instructions. This is an extension that is
| optional. If you don’t want to support it because it has 3 sources, you don’t have to. If you support F you already have the HW support for
| it. What is the driving decision to say it is a non-starter? Opcode space?

High-performance superscalar OoO implementations can be built better
(faster/wider/smaller/lower-power) when ISA doesn't have 3-input
integer instructions to consider.

IEEE-754 mandates 3-input FMA, but integer and FP are usually handled
in different units. Integer issue tends to be wider and more
latency-critical than FP issue.

Krste
| https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/PA4PR04MB801446DCAFD78C05435360578D769%40PA4PR04MB8014.eurprd04.prod.outlook.com
| .

Bruce Hoult

unread,
May 9, 2023, 11:35:00 PM5/9/23
to Jeff Scott, Tommy Thorn, Anders Lindgren, isa...@groups.riscv.org
> If you support F you already have the HW support for it.

In the FP pipeline and register file, not the integer ones.

In floating point, 3-input FMA and friends are probably the most common instruction executed. As such it is well worth building an FP pipeline that can read three registers per instruction.

In integer, 3-input instructions (on machines that have them) are rather uncommon, individually and collectively. I guess by far the most common would be store with base + {scaled} index addressing, which in current RISC-V needs shNadd+store. It is much less clear that there is a benefit to investing in the extra hardware to support this than in the FP case.

If the ISA supports 3-input instructions but the implementation breaks them into multiple µops then that loses a lot of the point of them.

If a lot of people want to implement 3-input integer operations in their cores then it makes sense to standardise the instructions, semantics, opcode allocation in an optional extension so everyone does it the same.

But is it core implementers who are pulling for 3-input operations, or just programmers saying "wouldn't it be nice..."?

BGB

unread,
May 10, 2023, 2:36:21 AM5/10/23
to isa...@groups.riscv.org
On 5/9/2023 8:02 PM, krste via RISC-V ISA Dev wrote:
>
>>>>>> On Tue, 9 May 2023 15:48:35 +0000, Jeff Scott <jeff....@nxp.com> said:
> | Tommy,
> | Why is a 3 input extension a non-starter? Many other common architectures support 3 input instructions. This is an extension that is
> | optional. If you don’t want to support it because it has 3 sources, you don’t have to. If you support F you already have the HW support for
> | it. What is the driving decision to say it is a non-starter? Opcode space?
>
> High-performance superscalar OoO implementations can be built better
> (faster/wider/smaller/lower-power) when ISA doesn't have 3-input
> integer instructions to consider.
>
> IEEE-754 mandates 3-input FMA, but integer and FP are usually handled
> in different units. Integer issue tends to be wider and more
> latency-critical than FP issue.
>

It is possibly a different perspective from someone with a strictly
in-order core that already has some 3-input instructions and handles
floating-point and SIMD stuff as a part of the integer pipeline...

Jeff Scott

unread,
May 10, 2023, 10:24:00 AM5/10/23
to kr...@sifive.com, Tommy Thorn, Anders Lindgren, isa...@groups.riscv.org
Thanks Krste. If HPC finds it difficult to support an optional extension with 3 source operands, it doesn't have to. For cases where 3 source operands do make sense for in order cores we should be open to it rather than state it is a non-starter. Some may argue Zcmp and Zcmt are difficult to support in HPC (even in order cores for that matter), but we still have them.

Jeff
| http://www.iar.com/
| LinkedIn

| --
| You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
| To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
| To view this discussion on the web visit
| https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/adbad487-e265-0844-4499-02e953f41dab%40iar.com.

| --
| You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
| To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
| To view this discussion on the web visit
| https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAPZiS_6jtrrBUTiuFhi6MUhSEqPXeDTmKZ%2BtRLfPpO1ghKJK_g%40mail.gmail.com.

| --
| You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
| To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
| To view this discussion on the web visit
| https://grou/
| ps.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2FPA4PR04MB
| 801446DCAFD78C05435360578D769%2540PA4PR04MB8014.eurprd04.prod.outlook.
| com&data=05%7C01%7Cjeff.scott%40nxp.com%7C31fa42861fd14047346d08db50f2
| 4b87%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C638192773800438918%7
| CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1
| haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=CF%2FyUWO2fkQLfHIKidEXjbZ86v
| pSHhlrsaBCPUXpUvo%3D&reserved=0
| .

Jeff Scott

unread,
May 10, 2023, 10:29:54 AM5/10/23
to Bruce Hoult, Tommy Thorn, Anders Lindgren, isa...@groups.riscv.org

Hi Bruce,

 

I come from a Zfinx world, so the HW to support 3 source ops is the same regardless of integer or float.  My only point is 3 source should not be a “non-starter”.   In cases where it makes sense, we should absolutely do it.  In cases where it doesn’t make sense, we absolutely should not do it.

 

Jeff

 

From: Bruce Hoult <br...@hoult.org>
Sent: Tuesday, May 9, 2023 10:35 PM
To: Jeff Scott <jeff....@nxp.com>
Cc: Tommy Thorn <tommy...@esperantotech.com>; Anders Lindgren <anders....@iar.com>; isa...@groups.riscv.org

Jeff Scott

unread,
May 10, 2023, 10:39:18 AM5/10/23
to BGB, isa...@groups.riscv.org
Bingo!

Jeff
> | http://www.iar.com/
> | LinkedIn
>
> | --
> | You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> | To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> | To view this discussion on the web visit
> | https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/adbad487-e265-0844-4499-02e953f41dab%40iar.com.
>
> | --
> | You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> | To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> | To view this discussion on the web visit
> | https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAPZiS_6jtrrBUTiuFhi6MUhSEqPXeDTmKZ%2BtRLfPpO1ghKJK_g%40mail.gmail.com.
>
> | --
> | You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> | To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> | To view this discussion on the web visit
> | https://gr/
> | oups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2FPA4PR
> | 04MB801446DCAFD78C05435360578D769%2540PA4PR04MB8014.eurprd04.prod.ou
> | tlook.com&data=05%7C01%7Cjeff.scott%40nxp.com%7Ccae22d39bba547312c2e
> | 08db5120df0e%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C6381929738
> | 44484816%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzI
> | iLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2F%2F5VPdQIlq
> | vyH%2B4O%2BD2Ah5ex8%2FL1MBmEoArS5dV1LWE%3D&reserved=0
> | .
>

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/f92bd92e-f656-1f51-42d3-698a199d4c86%40gmail.com.

Tariq Kurd

unread,
May 10, 2023, 11:17:05 AM5/10/23
to Jeff Scott, BGB, isa...@groups.riscv.org
Yes - you beat me to pointing out that Zfinx already has this - so any Zfinx core already has to handle hazards on all 3 source operands.
And it's also true that Zcmp/Zcmt will be harder to implement on higher performance cores too.

(all of my extensions coming up here!)

I wonder if we could do something along the lines of Zcmp/Zcmt with the overlapping encodings which are already in use and allocate a couple of 16-bit encodings for this purpose. 
That would help the code-size and remove the short branches, and potentially allow 3 inputs too.

this could be a future idea for code-size (I'm gathering ideas at the moment)

Tariq







Jeff Scott

unread,
May 10, 2023, 11:24:49 AM5/10/23
to Tariq Kurd, BGB, isa...@groups.riscv.org

Also, PowerPC base ISA had 3 source operands and there were many OoO implementations of it.  So, it may be more of a headache, but can be done.

 

Jeff

 

From: Tariq Kurd <tariq...@codasip.com>
Sent: Wednesday, May 10, 2023 10:17 AM
To: Jeff Scott <jeff....@nxp.com>

Tariq Kurd

unread,
May 10, 2023, 11:31:08 AM5/10/23
to Jeff Scott, BGB, isa...@groups.riscv.org
I don't think anyone doubts that it can be done - but the machine gets bigger/slower/more complex as a result, so if you're paying that overhead you need to make sure that the benefit is there.
For a Zfinx machine the overhead has already been paid for, so it's a much easier choice.

Tariq

Tommy Thorn

unread,
May 10, 2023, 11:31:43 AM5/10/23
to Jeff Scott, Tariq Kurd, BGB, isa...@groups.riscv.org
Of course it can be done, but nothing is free and the question shouldn’t be “can’t we do it” but “does this improve performance across the board”.  The fact is that where are extremely few *integer* instructions that benefits for three sources - a tautology on current RISC-V as there are none.

If the aim is to have parity or even exceed the performance of Arm (and x86?), then we shouldn’t put roadblocks in the way.  Rename ports and register ports are a premium resource that is highly optimized for.  The timing and area cost of renaming is roughly a square of the number of sources and destinations - grow that and you either have to accept a lower frequency or a longer frontend/higher branch mispenalty, both performance reducing options.  And for what?  Where are the data that proves that this addition would more than make up for the penalty?

Tommy


Jeff Scott

unread,
May 10, 2023, 11:38:56 AM5/10/23
to Tommy Thorn, Tariq Kurd, BGB, isa...@groups.riscv.org

Tommy, I don’t disagree, but remember, these are optional extensions.  They may not make sense in HPC, but they may make sense in embedded MCU world.  Let’s not neglect the needs of the MCU space.  I am not talking specifically about Zicond here.  This is a general statement to 3 source being a “non-starter”.

 

Jeff

Tommy Thorn

unread,
May 10, 2023, 11:39:58 AM5/10/23
to Jeff Scott, Tariq Kurd, BGB, isa...@groups.riscv.org
Show me the [convincing] data that this extension pays for itself.  Not just a single benchmark either.

Tommy

Jeff Scott

unread,
May 10, 2023, 11:41:04 AM5/10/23
to Tommy Thorn, Tariq Kurd, BGB, isa...@groups.riscv.org

My statements are not about “this extension”.

kr...@sifive.com

unread,
May 10, 2023, 1:11:01 PM5/10/23
to Jeff Scott, Tommy Thorn, Tariq Kurd, BGB, isa...@groups.riscv.org

Three-input integer instructions certainly have a much bigger barrier
to overcome for consideration for non-Zfinx systems.

In terms of evaluation, it would be interesting to see the incremental
benefit across a range of benchmarks in performance and code size once
Zicond (which has much smaller cost) is included in baseline. It
might be the gain is large enough for those building Zfinx to consider
proposing as a extension.

To be clear, I'm personally skeptical of there being a significant
benefit, assuming Zicond in baseline, even for Zfinx systems, but we
should always be open to seeing the quantitative data.

Krste
| .
||
|| | --
|| | You received this message because you are subscribed to the Google Groups "RISC-V ISA
| Dev" group.
|| | To unsubscribe from this group and stop receiving emails from it, send an email to
| isa-dev+u...@groups.riscv.org.
|| | To view this discussion on the web visit
|| |
| https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAPZiS_6jtrrBUTiuFhi6MUhSEqPXeDTmKZ%2BtRLfPpO1ghKJK_g%40mail.gmail.com
| https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/PA4PR04MB80140EB2A14DBC801A8536588D779%40PA4PR04MB8014.eurprd04.prod.outlook.com
| .

| --
| You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev"
| group.
| To unsubscribe from this group and stop receiving emails from it, send an email to
| isa-dev+u...@groups.riscv.org.
| To view this discussion on the web visit
| https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/PA4PR04MB8014768ED7853EE959502E328D779%40PA4PR04MB8014.eurprd04.prod.outlook.com
| .

| --
| You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
| To unsubscribe from this group and stop receiving emails from it, send an email to
| isa-dev+u...@groups.riscv.org.
| To view this discussion on the web visit
| https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/PA4PR04MB80142099AF3A04A444FE3F718D779%40PA4PR04MB8014.eurprd04.prod.outlook.com
| .

Allen Baum

unread,
May 10, 2023, 5:33:54 PM5/10/23
to kr...@sifive.com, Jeff Scott, Tommy Thorn, Tariq Kurd, BGB, isa...@groups.riscv.org
And, of course, if your market segment/application/configuration can benefit from it at reasonable cost, 
you should feel free to implement your secret sauce as a custom extension - and that's a good thing (for you)
That's exactly why we have custom extensions.
IF that market segment/application/configuration is big enough, it will inevitably become a standard.



Christopher Celio

unread,
May 10, 2023, 6:17:42 PM5/10/23
to Tommy Thorn, RISC-V ISA Specification Discussion
I have to agree with Tommy here. 3-operand integer is very expensive for superscalar cores. While the register file and execution datapaths can be sized to support "80-percentile" use-case, the decode/dependency-check/rename at the front of the machine has to be sized for worst-case (all integer instructions needing 3-operands). Yes, ARM and friends show it is possible, but it is a fundamental critical path that limits machine width. Considering average register operand needs are <1.5 per instruction, throwing down 3 ports for every instruction and having that be the limiting width factor feels a bit off. It's nice if 3-operands can cut down your program's "critical path length", but not if you added effective path length elsewhere in your design via cracking or sequencing. 

Also, once you have 3-operand int instructions, you probably need to double-down and add a lot more dense/CISC operations to keep the arithmetic intensity up to make up for the width limiting that 3-operand support just caused (which then are likely cracked into multiple uops after making it past the rename pinch-point). This more "rich" approach is a perfectly fine design point to choose, but an open ISA should be mindful of the complexity each extension adds and whether or not a new extension will be truly "optional". 

And this isn't just about "HPC", it's any place a superscalar/out-of-order core may end up. Historically, OoO cores subsume domains IO cores used to be considered "good enough" in, so any extensions/profiles that crystallize around IO cores may eventually become legacy for OoO cores to support. 

-Chris

MitchAlsup

unread,
May 10, 2023, 7:16:37 PM5/10/23
to RISC-V ISA Dev, Christopher Celio, RISC-V ISA Specification Discussion, Tommy Thorn
On Wednesday, May 10, 2023 at 5:17:42 PM UTC-5 Christopher Celio wrote:
I have to agree with Tommy here. 3-operand integer is very expensive for superscalar cores. While the register file and execution datapaths can be sized to support "80-percentile" use-case, the decode/dependency-check/rename at the front of the machine has to be sized for worst-case (all integer instructions needing 3-operands).

This is no different from running into 11 FMAC instructions in a dense sequence. Sooner or later the decoder needs to punt.
But, if the FP side can do it, just instantiate another copy of the FP renamer for the (int) side.

But importantly, over on the FP side there is 1 FMAC (3-regsiter ports) unit and several other FP units of 2-operand and 1-operand varieties.
So, I don't see why the same rules do not apply over on the int side:: {one  × 3-operand, n-2 × 2-operand, and one 1-operand} per cycle and
you still have the register porting average of 2-operands per instruction.
 
Yes, ARM and friends show it is possible, but it is a fundamental critical path that limits machine width. Considering average register operand needs are <1.5 per instruction, throwing down 3 ports for every instruction and having that be the limiting width factor feels a bit off.

You DON'T lay down 3-operands for any/every instruction, <above> you partition the porting and renaming.
 
It's nice if 3-operands can cut down your program's "critical path length", but not if you added effective path length elsewhere in your design via cracking or sequencing. 

Also, once you have 3-operand int instructions, you probably need to double-down and add a lot more dense/CISC operations to keep the arithmetic intensity up to make up for the width limiting that 3-operand support just caused (which then are likely cracked into multiple uops after making it past the rename pinch-point).

Realistically, you would want { (int)MAC, INSert, and CMOV }, as the 3-operand heavy hitters {and likely punt the rest}

Tommy Thorn

unread,
May 10, 2023, 7:19:38 PM5/10/23
to MitchAlsup, RISC-V ISA Dev, Christopher Celio

> On May 10, 2023, at 16:16, 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:
>
> Yes, ARM and friends show it is possible, but it is a fundamental critical path that limits machine width. Considering average register operand needs are <1.5 per instruction, throwing down 3 ports for every instruction and having that be the limiting width factor feels a bit off.
>
> You DON'T lay down 3-operands for any/every instruction, <above> you partition the porting and renaming.

You don’t get to chose where the 3-op insns ends up, so either you prepare for the worst case or you add yet more cycle-time wasting muxing to shift things around. There’s no free lunch.

Tommy

MitchAlsup

unread,
May 10, 2023, 7:27:52 PM5/10/23
to RISC-V ISA Dev, Tommy Thorn, RISC-V ISA Dev, Christopher Celio, MitchAlsup
You don't get to choose where the FADD and FMAC and FCMP end up either. And whereas most (int) instructions can use a small function unit (ALU) that is replicated as much as you like/can-afford, FMAC and FADD do not fall into the "replicate as much as you like" category. So, the FP side is the harder one.

Tommy Thorn

unread,
May 10, 2023, 7:38:04 PM5/10/23
to MitchAlsup, RISC-V ISA Dev, Christopher Celio
You are correct we are already partitioning the FP and Int in the renamer, but there’s no interaction between FP register and Integer registers so this isn’t a big deal.  I don’t see how you can pull 3-op instructions out of an interdependent integer sequence and treat it as a special case.  You guess could stall the renamer for a cycle to handle the exceptional 3-op instruction, but that also adds complication and over, in addition to penalty cycle.

You’d have to come up with some spectacularly convincing benchmarks to justify 3-op. Of the three examples (MAC, INS, and CMOV), only the latter might be interesting, but it’s actually pretty hard for the compiler to make effective use of it.

There is still no free lunch.

Tommy



There’s no free lunch.

Tommy


--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

Paul Campbell

unread,
May 10, 2023, 7:56:15 PM5/10/23