Integer Carry propagation, Overflow, Underflow and Divide by Zero

2,486 views
Skip to first unread message

Michael Clark

unread,
Aug 27, 2016, 8:44:35 PM8/27/16
to RISC-V ISA Dev
Hi All,

Just a quick thought. Likely already in a few peoples minds.

I am aware of the design decision in RISC-V to avoid a flag state as this creates a pipeline bottleneck.

But has anyone considered extending the architectural registers with hidden bits to turn this bottleneck into a regular register dependency. i.e. 69-bit registers:

- NA = Not an Address, operation on two addresses, as a distinct field)
- DZ = Divide Zero
- OF = Overflow from signed or unsigned multiply, Conversion to from signed unsigned
- UF = Overflow from signed or unsigned multiply, Conversion to from signed unsigned
- CA = Carry from Add or Borrow from Subtract (separate Borrow flag)

This is non-standard. I’m just reasoning about why flags were removed (pipeline hazards) and where they could be; thinking of something symmetrical with hidden FPU internal state.

Of course it would require some sort of extension, but add with carry, subtract with carry or subtract with borrow could be implemented using only existing register dependencies in an out of order implementation, and given the architecture is load store with all explicit operands (no messy memory operands that would break this), this could potentially be quite neat and tidy (doesn’t even require tagged memory). The address truth table is pretty easy to work out. The only legal operation is to add or subtract a scalar to an address. An extension would not need to expose flags rather using Branch on flags specifying a register. Quite different to some other architectures that branch on a bottleneck flag state.

This would be an extension for Architectural Integer Register Flags and Conditional Branches. An Add or an Add with Carry in this architecture both depend on the register state of their operands, and not on any centralised flag state. I don’t know if it is worth it. Just thought I’d share my thoughts.

I haven’t reasoned about whether precise exceptions in the ALU complicate things however I am thinking of applications that may wish to receive exceptions on Underflow, Overflow or propagate carries in an efficient way. The last time I emailed about this, it was about reading integer flag state similarly to fflags however if they are part of the register they can be used in branches with one register, a register used as 5 flags and a short branch immediate e.g. would fit into the coding space of a regular branch assuming there are enough remaining opcodes. This is different to “other” architectures that have centralised flags.

The reason I thought about this was due to implementing some carry detection in C for big integer math without using any hardware state like flags. A lot of these underflow and overflow conditions are undefined behaviour in C; and one wonders whether one day they will become defined.

The other thought was extensions for Saturated Arithmetic (graceful underflow/overflow). Roadmap item for someone who needs it. Throttles that don’t magically reverse direction...

Michael.

Stefan O'Rear

unread,
Aug 27, 2016, 9:34:18 PM8/27/16
to Michael Clark, RISC-V ISA Dev
On Sat, Aug 27, 2016 at 5:44 PM, Michael Clark <michae...@mac.com> wrote:
> Hi All,
>
> Just a quick thought. Likely already in a few peoples minds.
>
> I am aware of the design decision in RISC-V to avoid a flag state as this creates a pipeline bottleneck.
>
> But has anyone considered extending the architectural registers with hidden bits to turn this bottleneck into a regular register dependency. i.e. 69-bit registers:

Using widened physical registers, as an implementation technique, is
presumably well-known to the Foundation as it is explicitly called out
on page 8 of Andrew Waterman's thesis in relation to the
implementation of conditional moves on the Alpha 21264.

> - NA = Not an Address, operation on two addresses, as a distinct field)
> - DZ = Divide Zero
> - OF = Overflow from signed or unsigned multiply, Conversion to from signed unsigned
> - UF = Overflow from signed or unsigned multiply, Conversion to from signed unsigned
> - CA = Carry from Add or Borrow from Subtract (separate Borrow flag)
>
> This is non-standard. I’m just reasoning about why flags were removed (pipeline hazards) and where they could be; thinking of something symmetrical with hidden FPU internal state.
>
> Of course it would require some sort of extension, but add with carry, subtract with carry or subtract with borrow could be implemented using only existing register dependencies in an out of order implementation, and given the architecture is load store with all explicit operands (no messy memory operands that would break this), this could potentially be quite neat and tidy (doesn’t even require tagged memory). The address truth table is pretty easy to work out. The only legal operation is to add or subtract a scalar to an address. An extension would not need to expose flags rather using Branch on flags specifying a register. Quite different to some other architectures that branch on a bottleneck flag state.

You have just given the reason why this is not in the specification:
"in an out of order implementation". Flag bits are extremely cheap in
microcode and single-issue implementations, wider registers are
(relatively) cheap on OoO implementations, RISC-V provides neither in
the base spec in a nod to microarchitecture independence.

> This would be an extension for Architectural Integer Register Flags and Conditional Branches. An Add or an Add with Carry in this architecture both depend on the register state of their operands, and not on any centralised flag state. I don’t know if it is worth it. Just thought I’d share my thoughts.
>
> I haven’t reasoned about whether precise exceptions in the ALU complicate things

If ALU operations can throw exceptions, then the renamer needs to
snapshot the architectural register state on each ALU operation,
rather than just jumps and memory access. Anything that increases the
work for the renamer is bad news.

> however I am thinking of applications that may wish to receive exceptions on Underflow, Overflow or propagate carries in an efficient way.

Nobody (meaning language implementors) uses trapping arithmetic even
on ISAs where it's available. The reason is that signal handling is a
PITA, both in terms of requiring OS-specific code, and because most
prebuilt compiler backends don't support "asynchronous" (between any
pair of instructions) language-level exceptions. It's much easier for
everyone involved to generate a conditional jump to a
language-provided panic routine.

So what I'm most interested in seeing in *new CPUs* is a suite of new
branch instructions that can be used for as-low-cost-as-possible
detection of integer overflow conditions. But see the next point.

> The last time I emailed about this, it was about reading integer flag state similarly to fflags however if they are part of the register they can be used in branches with one register, a register used as 5 flags and a short branch immediate e.g. would fit into the coding space of a regular branch assuming there are enough remaining opcodes.
> This is different to “other” architectures that have centralised flags.

I think we are working from very different ideas of the purpose of the
RISC-V core specification. IMO it's not supposed to be "different
from other architectures". It's supposed to be as familiar as
possible to compiler writers and CPU designers, minimizing novelty and
surprises and maximizing precedent so that it can become the Standard
Boring ISA.

As part of being the Standard Boring ISA, it has facilities to be
extended with site-specific Interestingness. But that is not the same
as being Interesting itself. Interesting features like 69-bit
architectural register files and physical memory partititioning do
not, IMO, have a place in the baseline specifications 2016. If they
become wildly popular among vendors, then they could be considered for
a standard extension later, but the standard has to lag
implementations, not lead them. (If you look on github you'll see
that they switched Rocket to priv-1.9 well before publishing the
actual document.)

Standardizing things before they have been implemented is how you get
trainwrecks. Do not go there.

>
> The reason I thought about this was due to implementing some carry detection in C for big integer math without using any hardware state like flags. A lot of these underflow and overflow conditions are undefined behaviour in C; and one wonders whether one day they will become defined.

Overflow and several related things are undefined in C because of a
perception that leaving them undefined helps loop optimization. The
optimizer doesn't need to consider what happens when an induction
variable wraps around, and a loop like for (int x = 0; x != j; x++)
some_function(); can be inferred to terminate because either j is
positive and will be reached, or j is negative and the loop is UB.

** This has nothing to do with ISA-level behavior. **

Anyway, carry detection for addition in C is a bad example, because it
can already be done with the SLT instruction. This is valid C and
results in two RVI instructions after inlining:

unsigned add_with_carry(unsigned a, unsigned b, unsigned *carry_out) {
unsigned ab = a + b;
*carry_out = ab < a;
return ab;
}

> The other thought was extensions for Saturated Arithmetic (graceful underflow/overflow). Roadmap item for someone who needs it. Throttles that don’t magically reverse direction...

The spec should lag silicon here. Know anyone building RISC-V based DSPs?

-s

Jacob Bachmeyer

unread,
Aug 27, 2016, 10:42:24 PM8/27/16
to Michael Clark, RISC-V ISA Dev
Michael Clark wrote:
> - NA = Not an Address, operation on two addresses, as a distinct field)
>
> [...]
>
> [...] (doesn’t even require tagged memory). The address truth table is pretty easy to work out. The only legal operation is to add or subtract a scalar to an address.
>
How do you know if a value loaded from memory is an address or a scalar
without tagged memory?

-- Jacob

Michael Clark

unread,
Aug 27, 2016, 10:52:44 PM8/27/16
to jcb6...@gmail.com, RISC-V ISA Dev

On 28 Aug 2016, at 2:42 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

How do you know if a value loaded from memory is an address or a scalar without tagged memory?

It would potentially be possible but it needs close coordination between the compiler, the binary format and binary metadata (or ISA extension).

C Unions might be a problem; a corner case; but generally memory holds either a pointer or scalar and this is /usually/ or /most often/ known at compile time for languages like C.

It depends very much on the language. This is something I wanted to spend some time with further study of PIC code from the various compilers.

AUIPC+ADDI is always loading a relative address

LUI+ADDI is problematic as it could be loading an absolute address or a constant

Likewise a register relative load could be loading an address or scalar.

The types are known at compile time except for the aliasing cases.

The types may even be able to be resolved post compile from DWARF metadata.

Essentially if these loads and casts are labelled, there may be no need for memory tags. i.e. if the object code says it is loading an address, there is no need for tagged memory.

It would certainly be nice if the combination of ISA and binary encapsulation e.g. ELF+DWARF let one introspect on this at load/verification time...

Michael.
signature.asc

Michael Clark

unread,
Aug 27, 2016, 11:10:52 PM8/27/16
to Stefan O'Rear, RISC-V ISA Dev

> On 28 Aug 2016, at 1:34 PM, Stefan O'Rear <sor...@gmail.com> wrote:
>
> On Sat, Aug 27, 2016 at 5:44 PM, Michael Clark <michae...@mac.com> wrote:
>> Hi All,
>>
>> Just a quick thought. Likely already in a few peoples minds.
>>
>> I am aware of the design decision in RISC-V to avoid a flag state as this creates a pipeline bottleneck.
>>
>> But has anyone considered extending the architectural registers with hidden bits to turn this bottleneck into a regular register dependency. i.e. 69-bit registers:
>
> Using widened physical registers, as an implementation technique, is
> presumably well-known to the Foundation as it is explicitly called out
> on page 8 of Andrew Waterman's thesis in relation to the
> implementation of conditional moves on the Alpha 21264.

I must read Andrew’s thesis in more detail. It occured to me from reasoning within the confines of the RISC-V load/store model that this is all much clearer and easier to handle.

>> - NA = Not an Address, operation on two addresses, as a distinct field)
>> - DZ = Divide Zero
>> - OF = Overflow from signed or unsigned multiply, Conversion to from signed unsigned
>> - UF = Overflow from signed or unsigned multiply, Conversion to from signed unsigned
>> - CA = Carry from Add or Borrow from Subtract (separate Borrow flag)
>>
>> This is non-standard. I’m just reasoning about why flags were removed (pipeline hazards) and where they could be; thinking of something symmetrical with hidden FPU internal state.
>>
>> Of course it would require some sort of extension, but add with carry, subtract with carry or subtract with borrow could be implemented using only existing register dependencies in an out of order implementation, and given the architecture is load store with all explicit operands (no messy memory operands that would break this), this could potentially be quite neat and tidy (doesn’t even require tagged memory). The address truth table is pretty easy to work out. The only legal operation is to add or subtract a scalar to an address. An extension would not need to expose flags rather using Branch on flags specifying a register. Quite different to some other architectures that branch on a bottleneck flag state.
>
> You have just given the reason why this is not in the specification:
> "in an out of order implementation". Flag bits are extremely cheap in
> microcode and single-issue implementations, wider registers are
> (relatively) cheap on OoO implementations, RISC-V provides neither in
> the base spec in a nod to microarchitecture independence.

Understand.

>> This would be an extension for Architectural Integer Register Flags and Conditional Branches. An Add or an Add with Carry in this architecture both depend on the register state of their operands, and not on any centralised flag state. I don’t know if it is worth it. Just thought I’d share my thoughts.
>>
>> I haven’t reasoned about whether precise exceptions in the ALU complicate things
>
> If ALU operations can throw exceptions, then the renamer needs to
> snapshot the architectural register state on each ALU operation,
> rather than just jumps and memory access. Anything that increases the
> work for the renamer is bad news.

So no exceptions. Just branches.

>> however I am thinking of applications that may wish to receive exceptions on Underflow, Overflow or propagate carries in an efficient way.
>
> Nobody (meaning language implementors) uses trapping arithmetic even
> on ISAs where it's available. The reason is that signal handling is a
> PITA, both in terms of requiring OS-specific code, and because most
> prebuilt compiler backends don't support "asynchronous" (between any
> pair of instructions) language-level exceptions. It's much easier for
> everyone involved to generate a conditional jump to a
> language-provided panic routine.
>
> So what I'm most interested in seeing in *new CPUs* is a suite of new
> branch instructions that can be used for as-low-cost-as-possible
> detection of integer overflow conditions. But see the next point.

Yes. The reason I am interested is that at some point someone will need to implement some extensions and implement compiler support.

I recall the original reason that spurred the thought was UBSAN checks which use conditional branches on flags on other architectures.

>> The last time I emailed about this, it was about reading integer flag state similarly to fflags however if they are part of the register they can be used in branches with one register, a register used as 5 flags and a short branch immediate e.g. would fit into the coding space of a regular branch assuming there are enough remaining opcodes.
>> This is different to “other” architectures that have centralised flags.
>
> I think we are working from very different ideas of the purpose of the
> RISC-V core specification. IMO it's not supposed to be "different
> from other architectures". It's supposed to be as familiar as
> possible to compiler writers and CPU designers, minimizing novelty and
> surprises and maximizing precedent so that it can become the Standard
> Boring ISA.
>
> As part of being the Standard Boring ISA, it has facilities to be
> extended with site-specific Interestingness. But that is not the same
> as being Interesting itself. Interesting features like 69-bit
> architectural register files and physical memory partititioning do
> not, IMO, have a place in the baseline specifications 2016. If they
> become wildly popular among vendors, then they could be considered for
> a standard extension later, but the standard has to lag
> implementations, not lead them. (If you look on github you'll see
> that they switched Rocket to priv-1.9 well before publishing the
> actual document.)
>
> Standardizing things before they have been implemented is how you get
> trainwrecks. Do not go there.

No worries. Someone might prototype something in 2017 or 2018...

>>
>> The reason I thought about this was due to implementing some carry detection in C for big integer math without using any hardware state like flags. A lot of these underflow and overflow conditions are undefined behaviour in C; and one wonders whether one day they will become defined.
>
> Overflow and several related things are undefined in C because of a
> perception that leaving them undefined helps loop optimization. The
> optimizer doesn't need to consider what happens when an induction
> variable wraps around, and a loop like for (int x = 0; x != j; x++)
> some_function(); can be inferred to terminate because either j is
> positive and will be reached, or j is negative and the loop is UB.
>
> ** This has nothing to do with ISA-level behavior. **

although the ISA comes into play with respect to the efficiency of runtime UBSAN checks.

> Anyway, carry detection for addition in C is a bad example, because it
> can already be done with the SLT instruction. This is valid C and
> results in two RVI instructions after inlining:
>
> unsigned add_with_carry(unsigned a, unsigned b, unsigned *carry_out) {
> unsigned ab = a + b;
> *carry_out = ab < a;
> return ab;
> }

Yes. I’t unnatural and large code unfortunately and needs more reasoning about signed unsigned. I am currently looking at MULH[[S]U] in C, with overflow on the largest type for 128-bit additions.

>> The other thought was extensions for Saturated Arithmetic (graceful underflow/overflow). Roadmap item for someone who needs it. Throttles that don’t magically reverse direction...
>
> The spec should lag silicon here. Know anyone building RISC-V based DSPs?

Agree. No.

But I did hear about PoCL and HWACHA which are related to RISC-V however I understand they are non-standard extensions.

I can imagine the pixel pipe in Mesa could benefit from saturated arithmetic. It is present to some degree in some architectures SIMD extensions.

Perhaps Saturated Arithmetic may be in the Packed0SIMD Extension placeholder section at some time in the future.

Thanks,
Michael.

> -s
>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CADJ6UvOVvf1gfPqHS5gFxhdBJ4LCTt%2BBJO-Ajq9aXyQ9oENjcQ%40mail.gmail.com.

Samuel Falvo II

unread,
Aug 27, 2016, 11:29:22 PM8/27/16
to Michael Clark, Stefan O'Rear, RISC-V ISA Dev
On Sat, Aug 27, 2016 at 8:10 PM, Michael Clark <michae...@mac.com> wrote:
>> Anyway, carry detection for addition in C is a bad example, because it
>> can already be done with the SLT instruction. This is valid C and
>> results in two RVI instructions after inlining:
>>
>> unsigned add_with_carry(unsigned a, unsigned b, unsigned *carry_out) {
>> unsigned ab = a + b;
>> *carry_out = ab < a;
>> return ab;
>> }
>
> Yes. I’t unnatural and large code unfortunately and needs more reasoning about signed unsigned. I am currently looking at MULH[[S]U] in C, with overflow on the largest type for 128-bit additions.

I don't have any experience in this, but I've read from several
trusted sources that being able to add-with-carry and
subtract-with-carry proves important in optimizing cryptographic
algorithms. With security on everyone's mind these days, support for
performing multi-precision becomes more important. Of course, one can
use op-fusion to recognize this pattern:

add x2, x3, x4
slt x5, x2, x3
add x5, x5, x6
add x5, x5, x7

and interpret it as:

add x2, x3, x4, carryOut ; <== carryOut is internal to CPU
adc x5, x6, x7, carryOut

It'd be memory-expensive, but runtime efficient. Might be a worthy
tradeoff for applications that needs this capability. Depends on how
frequent one actually needs ADC. For a 32-bit CPU, I can readily see
it being more frequently needed than on a 64-bit CPU, and virtually
unnecessary on a 128-bit CPU.

--
Samuel A. Falvo II

Jacob Bachmeyer

unread,
Aug 27, 2016, 11:49:04 PM8/27/16
to Michael Clark, RISC-V ISA Dev
Michael Clark wrote:
>> On 28 Aug 2016, at 2:42 PM, Jacob Bachmeyer <jcb6...@gmail.com
>> <mailto:jcb6...@gmail.com>> wrote:
>>
>> How do you know if a value loaded from memory is an address or a
>> scalar without tagged memory?
> Essentially if these loads and casts are labelled, there may be no
> need for memory tags. i.e. if the object code says it is loading an
> address, there is no need for tagged memory.
So separate LOAD-SCALAR and LOAD-POINTER opcodes, then?
> It would certainly be nice if the combination of ISA and binary
> encapsulation e.g. ELF+DWARF let one introspect on this at
> load/verification time...
But checking at load time cannot make use of extra per-register flag
bits that are tracked only at run time. Or am I missing something?

Michael Clark

unread,
Aug 28, 2016, 12:03:34 AM8/28/16
to Samuel Falvo II, Stefan O'Rear, RISC-V ISA Dev
Yes this is interesting. One issue will be lowering the carry conditional to slt in a language like C that doesn’t have explicit carry support. I will check the RISC-V asm emitted when I get up to this point...

Jacob Bachmeyer

unread,
Aug 28, 2016, 12:09:58 AM8/28/16
to Michael Clark, Samuel Falvo II, Stefan O'Rear, RISC-V ISA Dev
There is no need for a carry conditional when the compiler can
essentially do op-fusion in reverse by looking up the source oprands of
the previous "add" and generating the "slt; add; add" sequence when the
compiler wants to emit "add-with-carry".

Michael Clark

unread,
Aug 28, 2016, 12:42:28 AM8/28/16
to jcb6...@gmail.com, RISC-V ISA Dev
On 28 Aug 2016, at 3:49 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

Michael Clark wrote:
On 28 Aug 2016, at 2:42 PM, Jacob Bachmeyer <jcb6...@gmail.com <mailto:jcb6...@gmail.com>> wrote:

How do you know if a value loaded from memory is an address or a scalar without tagged memory?
Essentially if these loads and casts are labelled, there may be no need for memory tags. i.e. if the object code says it is loading an address, there is no need for tagged memory.
So separate LOAD-SCALAR and LOAD-POINTER opcodes, then?

That was what I was thinking of, and making the the AUIPC+ADDI always a pointer and LUI+ADDI always integral and making all code PIC. It needs compiler coordination. I’m not up to this point with implementation so it’s more a heads up about this possibility.

It would certainly be nice if the combination of ISA and binary encapsulation e.g. ELF+DWARF let one introspect on this at load/verification time...
But checking at load time cannot make use of extra per-register flag bits that are tracked only at run time.  Or am I missing something?

If the pointers and integrals are clear in the machine code, then a certain amount of static verification that can be performed. There is potentially a form i.e. a convention or subset of RISC-V output that is amenable to static translation. e.g. placing a special nop before a load of a pointer. We know that sp points to the stack. We know that gp points to the TEXT. Knowing statically whether a register contains a pointer or integral is quite useful.

You may be thinking from a tagged memory perspective (dynamic vs static). When a typed language is loading a value into an ‘int’ or a ‘void*’, the compiler knows the value is pointer or integral, so we can emit machine code that distinguishes this. Likewise when there is a pointer cast. This can be used for static verification or potentially machine translation.

The difficult cases in C are aliasing e.g.

union {
  long d;
  void *g;
} a;

void foo(long d, void *g);

long b;

foo (b, (void*)&b);

However one could imagine a cast would emit a mv that could set or clear the hidden NA flag.

It needs some more thinking, but I believe one could statically analyse a binary prepared in a certain way, without tagged memory, and distinguish whether a register held a pointer or an integral.

It would need to be a well-formed verifiable subset of RISC-V and there may need to be some constraints on ELF segments.

This is the inspiration:

Stefan O'Rear

unread,
Aug 28, 2016, 2:18:50 AM8/28/16
to Samuel Falvo II, Michael Clark, RISC-V ISA Dev
On Sat, Aug 27, 2016 at 8:29 PM, Samuel Falvo II <sam....@gmail.com> wrote:
> I don't have any experience in this, but I've read from several
> trusted sources that being able to add-with-carry and
> subtract-with-carry proves important in optimizing cryptographic
> algorithms. With security on everyone's mind these days, support for
> performing multi-precision becomes more important. Of course, one can
> use op-fusion to recognize this pattern:

The real fun is with multiplication. Rocket has a variable-time
integer multiplier, so anyone writing generic crypto kernels for
RISC-V will probably avoid the MUL instruction entirely. Probably the
multiplications will be done by converting to floating point - this
was common on x86 in the Pentium era, since the Pentium float
multiplier had much higher throughput than its integer multiplier
IIUC.

OTOH, BOOM has a 1/cycle 64x64 multiplier that's impressively large on
the floorplans and would be nice to use, and other implementations
will have different constraints. This is why I'm interested in a
standard *user-level* mechanism for querying the microarchitecture, we
have mimpid in machine mode but nothing analogous for the AEE or SEE.

-s

Michael Clark

unread,
Aug 28, 2016, 2:39:32 AM8/28/16
to Stefan O'Rear, Samuel Falvo II, RISC-V ISA Dev
Yes. Constant time arithmetic will be important for crypto. It will be interesting when someone ports openssl.

Fixed width Karatsuba would be constant time no matter the size of the multiplicands. An implementation could potentially use say a 1024 entry table of 5-bit x 5-bit products however there may be cache timing effects. I don’t know how many cycles for a load from L1 would be. It would be interesting to compare approaches with the hardware multiplier. The benefit is obviously going to be in constant time, even if it is slower than the hardware multiplier.

The FPU gives us 52 bits but not all implementations will have FPU so a table based karatsuba multiplier might be useful. This would be interesting to experiment with.

I noticed the GCM code in openssl has various table size options to limit cache timing effects.

I also note that Montgomery multiplication for ECC can be done with shifts and adds which would be constant time on Rocket.

Stefan O'Rear

unread,
Aug 28, 2016, 3:02:08 AM8/28/16
to Michael Clark, Samuel Falvo II, RISC-V ISA Dev
On Sat, Aug 27, 2016 at 11:39 PM, Michael Clark <michae...@mac.com> wrote:
> Fixed width Karatsuba would be constant time no matter the size of the
> multiplicands. An implementation could potentially use say a 1024 entry
> table of 5-bit x 5-bit products however there may be cache timing effects. I
> don’t know how many cycles for a load from L1 would be. It would be
> interesting to compare approaches with the hardware multiplier. The benefit
> is obviously going to be in constant time, even if it is slower than the
> hardware multiplier.

Been there, done that, didn't work.

https://eprint.iacr.org/2016/224.pdf

L1 cache *hits* are not constant time on common 2016
microarchitectures, and leak address bits to code running SMT on the
same core.

> I noticed the GCM code in openssl has various table size options to limit
> cache timing effects.

AFAIK, the "constant time" GCM code is considered a joke by cryptographers.

-s

Michael Clark

unread,
Aug 28, 2016, 3:25:57 AM8/28/16
to Stefan O'Rear, Samuel Falvo II, RISC-V ISA Dev
Interesting.

I am curious about how one would implement constant time multiply on Rocket in software in a way that is resistant to cache timing effects.

~mc

Allen Baum

unread,
Aug 28, 2016, 3:39:57 AM8/28/16
to RISC-V ISA Dev, sor...@gmail.com
FYI: check out the Mill architecture on www.millcomputing.com.
They define pretty much what you outline here, but the architecture has all sorts of hooks for saving and restoring the metadata bits when necessary. And - this is a totally in-order architecture, so no renaming to get in the way.
This a very, very different architecture than any I've ever seen/heard/read about, intended to get close to out-of-order performance without out-of-order HW structures (so presumably faster or lower power for the same performance), with a heavy security focus.

> On Sat, Aug 27, 2016 at 5:44 PM, Michael Clark <michae...@mac.com> wrote:
>> Hi All,
>>
>> Just a quick thought. Likely already in a few peoples minds.
>>
>> I am aware of the design decision in RISC-V to avoid a flag state as this creates a pipeline bottleneck.
>>
>> But has anyone considered extending the architectural registers with hidden bits to turn this bottleneck into a regular register dependency. i.e. 69-bit registers:
>
> Using widened physical registers, as an implementation technique, is
> presumably well-known to the Foundation as it is explicitly called out
> on page 8 of Andrew Waterman's thesis in relation to the
> implementation of conditional moves on the Alpha 21264.

I must read Andrew’s thesis in more detail. It occured to me from reasoning within the confines of the RISC-V load/store model that this is all much clearer and easier to handle.

>> - NA = Not an Address, operation on two addresses, as a distinct field)
>> - DZ = Divide Zero
>> - OF = Overflow from signed or unsigned multiply, Conversion to from signed unsigned
>> - UF = Overflow from signed or unsigned multiply, Conversion to from signed unsigned
>> - CA = Carry from Add or Borrow from Subtract (separate Borrow flag)
.....

Paul A. Clayton

unread,
Aug 28, 2016, 9:20:49 PM8/28/16
to RISC-V ISA Dev
On 8/27/16, Stefan O'Rear <sor...@gmail.com> wrote:
> On Sat, Aug 27, 2016 at 5:44 PM, Michael Clark <michae...@mac.com>
> wrote:
[snip]
> If ALU operations can throw exceptions, then the renamer needs to
> snapshot the architectural register state on each ALU operation,
> rather than just jumps and memory access. Anything that increases the
> work for the renamer is bad news.

There is no need for a snapshot of architectural state at each potential
exception point. Hardware can use a previous snapshot and run forward,
knowing that the exception will occur at a given instruction address.
Some microarchitectures use a commit register alias table, where
minimal extra work would be required. Backtracking from the primary
register alias table is also possible.

The argument based on lack of language support makes much more
sense, though one could argue that such is a weakness of existing
software.

Mike Hamburg

unread,
Sep 24, 2016, 10:22:50 PM9/24/16
to Michael Clark, Stefan O'Rear, Samuel Falvo II, RISC-V ISA Dev
If users can’t rely on constant-time MUL, another option is to figure out a class of inputs for which MUL is constant-time and then restrict to them.  For example, if Rocket’s MUL always takes the full cycle count when the high bit is 1, we can use 31-bit arithmetic, add 0x80000000 to the inputs and then cancel it out in the UMULH implementation.  It’s a pain, though, and we’d have to figure out a consistent class of “safe” MUL inputs across different Rocket configs, which might not be easy.  Or we can pretend we’re German and just add lots of randomized blinding to everything.

I’d much rather designers made CPUs that will run portable crypto implementations securely.  This requires a non-interference property, that register values only propagate to architecturally visible state through certain channels.  In particular, “data” arguments of most instructions (all arithmetic instructions except div/mod; slt if there’s no adc; data load and stored by load/store but not the address; at least the shiftee of shift instructions; etc) should not flow to anything but the output of the instruction.  This nixes ideas like a branch predictor that overzealously looks at data registers, or that suggestion from a while ago that SLT could be branch-predicted.

Obviously CPU designers are free not to take these precautions, unless the arch spec mandates them (and it currently does not).  But then any product using that CPU will have a latent security risk.

Coming back to add with carry, the RISC-V approach works OK when you want to add 2 words with carry.  The first adc is basically 3 instructions instead of 1, which is not great but might be an acceptable consequence of a RISC architecture.  However, if you want to add 3 or more words it isn’t so nice, because x+y+c can wrap around to x again.  So you need to replace a single adc with 5 instructions (add; slt; add the carry; slt; combine the carry flags).  This means that reduced-radix code will always run better on RISC-V than full-radix code, since full-radix has more carry propagation.  Reduced-radix is fine for some crypto primitives but suboptimal for others.

Probably if anyone wants to do crypto with RISC-V and cares about its performance, they will need an ISA extension.  Maybe we should just add UMAAL and be done with it.  Or maybe just a 3-input, 2-output ADD variant would be enough to get started, in combination with a MUL/MULH that’s guaranteed to take constant time.

I’d also be interested to spitball some crypto instructions for the Vector extension, once that’s available.

Cheers,
— Mike

Stefan O'Rear

unread,
Sep 24, 2016, 10:43:08 PM9/24/16
to Mike Hamburg, Michael Clark, Samuel Falvo II, RISC-V ISA Dev
On Sat, Sep 24, 2016 at 7:22 PM, Mike Hamburg <mi...@shiftleft.org> wrote:
> If users can’t rely on constant-time MUL, another option is to figure out a
> class of inputs for which MUL is constant-time and then restrict to them.
> For example, if Rocket’s MUL always takes the full cycle count when the high
> bit is 1, we can use 31-bit arithmetic, add 0x80000000 to the inputs and
> then cancel it out in the UMULH implementation. It’s a pain, though, and
> we’d have to figure out a consistent class of “safe” MUL inputs across
> different Rocket configs, which might not be easy. Or we can pretend we’re
> German and just add lots of randomized blinding to everything.

OK, it is good to see a published crypto author on this thread.

(nit: it's MULHU, not UMULH)

Rocket has a fully pipelined double precision FMA unit, which has much
better throughput than the integer multiplier _and_ is constant time.
So when targeting Rocket specifically it's probably best to treat it
as a Pentium and do all the multiplications using the floating point
instructions.

> I’d much rather designers made CPUs that will run portable crypto
> implementations securely. This requires a non-interference property, that
> register values only propagate to architecturally visible state through
> certain channels. In particular, “data” arguments of most instructions (all
> arithmetic instructions except div/mod; slt if there’s no adc; data load and
> stored by load/store but not the address; at least the shiftee of shift
> instructions; etc)

Data point: PicoRV32 has shift instructions that have variable timing
on the shift amount. I can't name anything besides RC6 that would be
affected by that, though.

> should not flow to anything but the output of the
> instruction. This nixes ideas like a branch predictor that overzealously
> looks at data registers, or that suggestion from a while ago that SLT could
> be branch-predicted.

When I suggested earlier a way to query MUL timing, Krste Asanovic
[replied](https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/42IM6tstYOg/ePRyYIenDAAJ)
"Security needs to be handled a bit more holistically". I'm honestly
not sure how to interpret that; presumably this doesn't mean a single
extension with everything from "integer rotate" to "enclaves".

> Obviously CPU designers are free not to take these precautions, unless the
> arch spec mandates them (and it currently does not). But then any product
> using that CPU will have a latent security risk.
>
> Coming back to add with carry, the RISC-V approach works OK when you want to
> add 2 words with carry. The first adc is basically 3 instructions instead
> of 1, which is not great but might be an acceptable consequence of a RISC
> architecture. However, if you want to add 3 or more words it isn’t so nice,
> because x+y+c can wrap around to x again. So you need to replace a single
> adc with 5 instructions (add; slt; add the carry; slt; combine the carry
> flags). This means that reduced-radix code will always run better on RISC-V
> than full-radix code, since full-radix has more carry propagation.
> Reduced-radix is fine for some crypto primitives but suboptimal for others.

> Probably if anyone wants to do crypto with RISC-V and cares about its
> performance, they will need an ISA extension. Maybe we should just add
> UMAAL and be done with it. Or maybe just a 3-input, 2-output ADD variant
> would be enough to get started, in combination with a MUL/MULH that’s
> guaranteed to take constant time.

By UMAAL do you mean the ARM instruction of the same mnemonic?

You'll notice that there are no instructions in the current RISC-V
spec that write more than one X or F register, or that read more than
2 X registers or 3 F registers.

> I’d also be interested to spitball some crypto instructions for the Vector
> extension, once that’s available.

-s

Mike Hamburg

unread,
Sep 24, 2016, 11:10:46 PM9/24/16
to Stefan O'Rear, Michael Clark, Samuel Falvo II, RISC-V ISA Dev

> On Sep 24, 2016, at 7:43 PM, Stefan O'Rear <sor...@gmail.com> wrote:
>
> On Sat, Sep 24, 2016 at 7:22 PM, Mike Hamburg <mi...@shiftleft.org> wrote:
>> If users can’t rely on constant-time MUL, another option is to figure out a
>> class of inputs for which MUL is constant-time and then restrict to them.
>> For example, if Rocket’s MUL always takes the full cycle count when the high
>> bit is 1, we can use 31-bit arithmetic, add 0x80000000 to the inputs and
>> then cancel it out in the UMULH implementation. It’s a pain, though, and
>> we’d have to figure out a consistent class of “safe” MUL inputs across
>> different Rocket configs, which might not be easy. Or we can pretend we’re
>> German and just add lots of randomized blinding to everything.
>
> OK, it is good to see a published crypto author on this thread.
>
> (nit: it's MULHU, not UMULH)
>
> Rocket has a fully pipelined double precision FMA unit, which has much
> better throughput than the integer multiplier _and_ is constant time.
> So when targeting Rocket specifically it's probably best to treat it
> as a Pentium and do all the multiplications using the floating point
> instructions.

Ah. Yeah, that’s a better choice for Rocket then.

>> I’d much rather designers made CPUs that will run portable crypto
>> implementations securely. This requires a non-interference property, that
>> register values only propagate to architecturally visible state through
>> certain channels. In particular, “data” arguments of most instructions (all
>> arithmetic instructions except div/mod; slt if there’s no adc; data load and
>> stored by load/store but not the address; at least the shiftee of shift
>> instructions; etc)
>
> Data point: PicoRV32 has shift instructions that have variable timing
> on the shift amount. I can't name anything besides RC6 that would be
> affected by that, though.

That’s a useful data point. But as you said, it’s not so bad if the shift amount affects timing, since it basically affects just a few niche ciphers (RC5 also I think). If the value being shifted affects timing, that’s a much bigger problem.

>> should not flow to anything but the output of the
>> instruction. This nixes ideas like a branch predictor that overzealously
>> looks at data registers, or that suggestion from a while ago that SLT could
>> be branch-predicted.
>
> When I suggested earlier a way to query MUL timing, Krste Asanovic
> [replied](https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/42IM6tstYOg/ePRyYIenDAAJ)
> "Security needs to be handled a bit more holistically". I'm honestly
> not sure how to interpret that; presumably this doesn't mean a single
> extension with everything from "integer rotate" to "enclaves”.

Yeah, that’s not a very helpful answer. Maybe he means that you need some kind of system-wide noninterference property instead of just “MUL takes constant time”.

>> Probably if anyone wants to do crypto with RISC-V and cares about its
>> performance, they will need an ISA extension. Maybe we should just add
>> UMAAL and be done with it. Or maybe just a 3-input, 2-output ADD variant
>> would be enough to get started, in combination with a MUL/MULH that’s
>> guaranteed to take constant time.
>
> By UMAAL do you mean the ARM instruction of the same mnemonic?

Yes. UMAAL is a bit un-RISC-y, but it is essentially the inner loop of an operand-scanned multiply, and it stands in for add with carry in a pinch.

> You'll notice that there are no instructions in the current RISC-V
> spec that write more than one X or F register, or that read more than
> 2 X registers or 3 F registers.

Yes, I noticed that. Maybe there should be a MUL variant that’s specific to reduced radix? I think reduced radix is the only bignum technique I know that doesn’t require 3 inputs or 2 outputs for performance. I was thinking that if we were going to spec a crypto extension that exceeds the current spec’s limitations, then we might want to go whole hog and spec UMAAL.

Or maybe it would be better to make a vector extension. Do you know what the input/output constraints are likely to be on V instructions?

> -s

— Mike

Christopher Celio

unread,
Sep 24, 2016, 11:26:39 PM9/24/16
to Mike Hamburg, Stefan O'Rear, Michael Clark, Samuel Falvo II, RISC-V ISA Dev

Probably if anyone wants to do crypto with RISC-V and cares about its
performance, they will need an ISA extension.  Maybe we should just add
UMAAL and be done with it.  Or maybe just a 3-input, 2-output ADD variant
would be enough to get started, in combination with a MUL/MULH that’s
guaranteed to take constant time.

By UMAAL do you mean the ARM instruction of the same mnemonic?

Yes.  UMAAL is a bit un-RISC-y, but it is essentially the inner loop of an operand-scanned multiply, and it stands in for add with carry in a pinch.

You'll notice that there are no instructions in the current RISC-V
spec that write more than one X or F register, or that read more than
2 X registers or 3 F registers.

Yes, I noticed that.  Maybe there should be a MUL variant that’s specific to reduced radix?  I think reduced radix is the only bignum technique I know that doesn’t require 3 inputs or 2 outputs for performance.  I was thinking that if we were going to spec a crypto extension that exceeds the current spec’s limitations, then we might want to go whole hog and spec UMAAL.

Or maybe it would be better to make a vector extension.  Do you know what the input/output constraints are likely to be on V instructions?

Anything that's two-output is going to be a nightmare for the CPU, and will almost certainly just get handled by a micro-op generator... so why burden the ISA and just have the compiler generate the RISC micro-op sequence directly?  The uarch can still use macro-op fusion if it wants to give you higher performance.  This is how the MUL/MULH combo can give you two-outputs in one "8 byte instruction", and if I'm reading UMAAL correctly, it will be a "12 byte instruction"? (even less, if using RVC).

-Chris

-- 
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Michael Clark

unread,
Sep 25, 2016, 12:45:39 AM9/25/16
to Christopher Celio, Mike Hamburg, Stefan O'Rear, Samuel Falvo II, RISC-V ISA Dev
On 25 Sep 2016, at 4:26 PM, Christopher Celio <ce...@eecs.berkeley.edu> wrote:


Probably if anyone wants to do crypto with RISC-V and cares about its
performance, they will need an ISA extension.  Maybe we should just add
UMAAL and be done with it.  Or maybe just a 3-input, 2-output ADD variant
would be enough to get started, in combination with a MUL/MULH that’s
guaranteed to take constant time.

By UMAAL do you mean the ARM instruction of the same mnemonic?

Yes.  UMAAL is a bit un-RISC-y, but it is essentially the inner loop of an operand-scanned multiply, and it stands in for add with carry in a pinch.

You'll notice that there are no instructions in the current RISC-V
spec that write more than one X or F register, or that read more than
2 X registers or 3 F registers.

Yes, I noticed that.  Maybe there should be a MUL variant that’s specific to reduced radix?  I think reduced radix is the only bignum technique I know that doesn’t require 3 inputs or 2 outputs for performance.  I was thinking that if we were going to spec a crypto extension that exceeds the current spec’s limitations, then we might want to go whole hog and spec UMAAL.

Or maybe it would be better to make a vector extension.  Do you know what the input/output constraints are likely to be on V instructions?

Anything that's two-output is going to be a nightmare for the CPU, and will almost certainly just get handled by a micro-op generator... so why burden the ISA and just have the compiler generate the RISC micro-op sequence directly?  The uarch can still use macro-op fusion if it wants to give you higher performance.  This is how the MUL/MULH combo can give you two-outputs in one "8 byte instruction", and if I'm reading UMAAL correctly, it will be a "12 byte instruction"? (even less, if using RVC).

This is a very good constraint to have as it makes dependency analysis much easier.

Likewise the pseudo-ops are edge embeddings of existing ops that (in most cases) map to a single op. LI has a single output although it can generate quite large sequences of instructions. The length however is bounded.

Under those rules, this sort of construction below still qualified as RISC-y, as there is a single output and it uses architectural registers (I’m thinking about an RV128I ABI which stores 64-bit ‘offsets’ which are register relative):

- LD rd, rs2(rs1)
- SD rs2, rs3(rs1)

These two constructions can be decomposed into an ADD, SD and ADD, LD so they could potentially qualify as pseudo operations (or RVC-32). I understand they are distinct from LD, SD (the RV128I 128-bit LOAD and STORE instructions) and C.LD and C.SD which are relative to an immediate, and have a shift. I think combining more than one ALU op may violate the RISC-y constraint.

In the meta model for the RISC-V ops, these would share a similar property to the JAL and JALR pseudos, in that the same opcode is overloaded with more than one format, and LI, that expands to a sequence. In fact JAL and JALR are currently causing a problem for the model as they are the only two RISC-V ops that “op overload” or use the same name for a pseudo operation with a different format.

The constant time property (if implemented for integer shift, mul, div) is not applicable to loads and stores as we know they can vary substantially. They match these existing criteria:

- Single output criteria
- Addressing modes are limited to load and store

register + immediate vs register + register is similar to the operands of ADD vs ADDI. LOAD STORE unit has an adder for the immediate, so the immediate slot could be changed for a register. It could either be a fused macro-op or an RVC-32 bit embedding that expands to two base ISA ops. The idea of expanding the spec via an extension like RVC32 is an interesting idea!

The RISC criteria is likely violated by:

- accessing memory outside of LOAD / STORE / AMO / LR / SC
- operation with more than 1 output
- no repetition - e.g. looping instructions that don’t increment the program counter
- one ALU operation? (fused instructions are assumed to be one ALU operation)

AMOs have no add in their addressing mode i.e. simple address mode. The ADD normally used in the addressing mode is replaced with the bitwise ALU ops.

It’s a big question whether a typical RV128I ABI will store 64-bit offsets. It reminds me of x32 which is x86-64 set of 16 registers but with 32-bit pointers. I presume 64-bit math is fast with the x32 ABI… 

Michael.

Jacob Bachmeyer

unread,
Sep 25, 2016, 12:57:47 AM9/25/16
to Mike Hamburg, Stefan O'Rear, Michael Clark, Samuel Falvo II, RISC-V ISA Dev
Mike Hamburg wrote:
> On Sep 24, 2016, at 7:43 PM, Stefan O'Rear <sor...@gmail.com> wrote:
>
>> On Sat, Sep 24, 2016 at 7:22 PM, Mike Hamburg <mi...@shiftleft.org> wrote:
>>
>>> If users can’t rely on constant-time MUL, another option is to figure out a
>>> class of inputs for which MUL is constant-time and then restrict to them.
>>> For example, if Rocket’s MUL always takes the full cycle count when the high
>>> bit is 1, we can use 31-bit arithmetic, add 0x80000000 to the inputs and
>>> then cancel it out in the UMULH implementation. It’s a pain, though, and
>>> we’d have to figure out a consistent class of “safe” MUL inputs across
>>> different Rocket configs, which might not be easy. Or we can pretend we’re
>>> German and just add lots of randomized blinding to everything.
>>>
>> OK, it is good to see a published crypto author on this thread.
>>

I second Stefan's remark. It is good to have someone around that we can
call an expert on the matter.

I think that randomized blinding probably will be required for a generic
RISC-V cryptographic library--RISC-V tries very hard to be
microarchitecture-neutral and that may preclude even basic constant-time
primitives in the base ISA.

>> You'll notice that there are no instructions in the current RISC-V
>> spec that write more than one X or F register, or that read more than
>> 2 X registers or 3 F registers.
>>
>
> Yes, I noticed that. Maybe there should be a MUL variant that’s specific to reduced radix? I think reduced radix is the only bignum technique I know that doesn’t require 3 inputs or 2 outputs for performance. I was thinking that if we were going to spec a crypto extension that exceeds the current spec’s limitations, then we might want to go whole hog and spec UMAAL.
>

If I understand correctly, (I remember reading this somewhere) the
RISC-V project uses "2 inputs, 1 output, at most one memory access" as
the definition of "RISC instruction" to the extent that this is also a
criteria for candidates for macro-op fusion. Multiplication and
division operations appear to be "special" in this regard--the actual
arithmetic is assumed to require multiple cycles, so MULH/MUL fusion
does not reduce to a single macro-op, but fused MUL reads the other half
of the result from fused MULH instead of starting another
multiplication. I may be mistaken, but I believe that MULH/MUL fusion
also means that an implementation may write the MUL result before the
MULH result, even though the sequence is "MULH; MUL"; I do not know if
this would ever be visible, or if fused MULH/MUL is atomic.

To have good cryptographic implementations on RISC-V, I would suggest
that we concentrate on a minimal set of needed operations and work
towards getting those standardized as non-interfering, possibly in an
extension. Since RISC-V tries to be abstract and
microarchitecture-neutral, a "non-interference arithmetic" extension
might be a good approach.

> Or maybe it would be better to make a vector extension. Do you know what the input/output constraints are likely to be on V instructions?
>

The slides I have seen
(<URL:http://riscv.org/wp-content/uploads/2015/06/riscv-vector-workshop-june2015.pdf>)
suggest the existence of at least a vector-fused-multiply-add opcode, so
there is that. Now that I think about it, vectors like RISC-V is
pursuing seem almost ideal for implementing bignum arithmetic. The
catch, of course, is that constant-time vector operations are probably a
pipe dream, but blinding can help with that. We may want to push for a
vector-add-yielding-carry operation that produces carry-chain bits or
even a vector-add-with-carry operation that directly adds vectors as
bignums. :)

If "operand-scanned multiply" is what I know as "schoolbook
multiplication", that may actually be a plausible vector operation. :)

-- Jacob

Stefan O'Rear

unread,
Sep 25, 2016, 1:33:25 AM9/25/16
to Michael Clark, Christopher Celio, Mike Hamburg, Samuel Falvo II, RISC-V ISA Dev
On Sat, Sep 24, 2016 at 9:45 PM, Michael Clark <michae...@mac.com> wrote:
> Or maybe it would be better to make a vector extension. Do you know what
> the input/output constraints are likely to be on V instructions?

since RSA is functionally obsolete, cryptographic "bignums" are
typically 255, 521, 130, or similar bitsizes, and the bit size is
*fixed*. you are not going to use a loop because mispredicting after
two iterations will kill you and you are probably not going to use
variable-width vector instructions, which require a loop because you
can't compute the stride count at compile time. this is a job for the
entirely-undefined P extension, not the V extension.

Christopher Celio: I'm curious of one thing about the V extension.
How do you write a dot product? I don't see any way to efficiently do
a fold/reduce operation in the current ISA drafts.

> Under those rules, this sort of construction below still qualified as
> RISC-y, as there is a single output and it uses architectural registers (I’m
> thinking about an RV128I ABI which stores 64-bit ‘offsets’ which are
> register relative):
>
> - LD rd, rs2(rs1)
> - SD rs2, rs3(rs1)

This could be done but read ports are not cheap and making reg+reg
addressing mandatory would have been a bad idea. This was already
discussed at https://people.eecs.berkeley.edu/~waterman/papers/phd-thesis.pdf#page=34
, please do not bring it up without a good reason.

For RV128 I think we're going to see the return of "far". Most
software uses 64-bit addresses exclusively, but mmap128 returns a
"void *__far" which takes 16 bytes and can be used to access amounts
of memory far larger than your program. Note that both near and far
pointers can be handled using the existing load and store
instructions, since 64-bit values are sign extended (you will never
see "garbage in the upper half of a register" on RISC-V).

"RISCy" is not a concept which is relevant today. The architecture
landscape today is very different from the days of RISC I vs. the
68000, and the design parameters of RISC I are not important to us
today; the design parameters that _are_ important are amply discussed
in the spec and in Andrew's thesis. RISC-V is "RISC" for UCB
co-branding reasons. A lot of people seem to not understand this.

> These two constructions can be decomposed into an ADD, SD and ADD, LD so
> they could potentially qualify as pseudo operations (or RVC-32). I
> understand they are distinct from LD, SD (the RV128I 128-bit LOAD and STORE
> instructions) and C.LD and C.SD which are relative to an immediate, and have
> a shift. I think combining more than one ALU op may violate the RISC-y
> constraint.

C.LD and C.SD do not shift anything in the backend. They have a
"shift" in the interpretation of the immediate, but that's strictly an
issue for the decoder.

SD does not write anything, so a SHL+ADD+SD pseudo-instruction would
still need a write port. Chris can probably comment more on that.

See above on "RISC-y".

> In the meta model for the RISC-V ops, these would share a similar property
> to the JAL and JALR pseudos, in that the same opcode is overloaded with more
> than one format, and LI, that expands to a sequence. In fact JAL and JALR
> are currently causing a problem for the model as they are the only two
> RISC-V ops that “op overload” or use the same name for a pseudo operation
> with a different format.

I don't know what you mean by this. Neither JAL nor JALR (which are
instructions, not pseudos) shares a major opcode with anything else.
Are you maybe thinking of the J and JR pseudos?

Lots of instructions can be encoded either 2-byte or 4, LD and SD are not alone.

> The constant time property (if implemented for integer shift, mul, div) is
> not applicable to loads and stores as we know they can vary substantially.
> They match these existing criteria:

This statement is dangerously wrong. Loads and stores leak the
address but they MUST keep the value loaded or stored secret.

> - Single output criteria
> - Addressing modes are limited to load and store
>
> register + immediate vs register + register is similar to the operands of
> ADD vs ADDI. LOAD STORE unit has an adder for the immediate, so the
> immediate slot could be changed for a register. It could either be a fused
> macro-op or an RVC-32 bit embedding that expands to two base ISA ops. The
> idea of expanding the spec via an extension like RVC32 is an interesting
> idea!

Read Andrew's thesis. The design criterion here is not about ALUs.
ALUs are cheap. Read ports aren't.

> The RISC criteria is likely violated by:
>
> - accessing memory outside of LOAD / STORE / AMO / LR / SC
> - operation with more than 1 output
> - no repetition - e.g. looping instructions that don’t increment the program
> counter
> - one ALU operation? (fused instructions are assumed to be one ALU
> operation)

There are no "RISC criteria" which matter to us. "RISC criteria" died
with the 80s.

> AMOs have no add in their addressing mode i.e. simple address mode. The ADD
> normally used in the addressing mode is replaced with the bitwise ALU ops.

ADDs are not a scarce resource, again.

> It’s a big question whether a typical RV128I ABI will store 64-bit offsets.
> It reminds me of x32 which is x86-64 set of 16 registers but with 32-bit
> pointers. I presume 64-bit math is fast with the x32 ABI…

64-bit math isn't the reason you use x32 over classical multiarch.
The reason is reducing spills.

-s

Christopher Celio

unread,
Sep 25, 2016, 2:01:41 AM9/25/16
to Stefan O'Rear, Michael Clark, Mike Hamburg, Samuel Falvo II, RISC-V ISA Dev
Christopher Celio: I'm curious of one thing about the V extension.
How do you write a dot product?  I don't see any way to efficiently do
a fold/reduce operation in the current ISA drafts.

I'm not the V extension author, just a guy who implements the specs. Your best "view" into a possible V extension is to read Krste's PhD Thesis on implementing vector processors - it's a very good read.  I'm not aware of any V extension drafts that exist, but off the top of my head, if I were to guess, possible ways to do this are either a vector-half reduce (which requires multiple iterations) or perhaps a dedicated reduction instruction (like a vadd.reduce). Skimming Krste's thesis, section 10.1 has an interesting discussion on the pros and cons of a dedicated reduction instruction. The actual best ISA design for supporting a dot product is surprisingly non-obvious!


If I understand correctly, (I remember reading this somewhere) the RISC-V project uses "2 inputs, 1 output, at most one memory access" as the definition of "RISC instruction" to the extent that this is also a criteria for candidates for macro-op fusion.  Multiplication and division operations appear to be "special" in this regard--the actual arithmetic is assumed to require multiple cycles, so MULH/MUL fusion does not reduce to a single macro-op, but fused MUL reads the other half of the result from fused MULH instead of starting another multiplication.  I may be mistaken, but I believe that MULH/MUL fusion also means that an implementation may write the MUL result before the MULH result, even though the sequence is "MULH; MUL"; I do not know if this would ever be visible, or if fused MULH/MUL is atomic.


I put out a tech report recently on RISC-V and macro-op fusion, and I loosely referenced what I thought qualifies as a RISC instruction, but I in no way speak for RISC-V.


The fused MULH/MUL would be atomic, as you would not service an interrupt until finished and no intermediate instruction would come between them. Any OoO execution must strictly obey true program dependences. 


Under those rules, this sort of construction below still qualified as
RISC-y, as there is a single output and it uses architectural registers (I’m
thinking about an RV128I ABI which stores 64-bit ‘offsets’ which are
register relative):

- LD rd, rs2(rs1)
- SD rs2, rs3(rs1)

This could be done but read ports are not cheap and making reg+reg
addressing mandatory would have been a bad idea.  This was already
discussed at https://people.eecs.berkeley.edu/~waterman/papers/phd-thesis.pdf#page=34
, please do not bring it up without a good reason.

My above-mentioned tech report discusses the potential for reg+reg addressing fusion. It's "free" for loads, but requires an additional read port for stores (which can be expensive for many micro-architectures). Also agreed, Andrew's thesis is an excellent read. 


-Chris

Stefan O'Rear

unread,
Sep 25, 2016, 2:03:40 AM9/25/16
to jcb6...@gmail.com, Mike Hamburg, Michael Clark, Samuel Falvo II, RISC-V ISA Dev
On Sat, Sep 24, 2016 at 9:57 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> I think that randomized blinding probably will be required for a generic
> RISC-V cryptographic library--RISC-V tries very hard to be
> microarchitecture-neutral and that may preclude even basic constant-time
> primitives in the base ISA.

I think it's quite reasonable to say that instructions from the base
"I" ISA are not allowed to cause information to flow from their
operands into "non-architectural state which might affect future
nondeterministic behavior" (so leaving a cryptographic key in an
unused slot of the physical register file is fine as long as it can
never be read by software), with the following exceptions:

* The identity of executing instructions, and the PC value
* Branch conditions of branches that cannot affect the PC (I don't
know why you would do that, but OK)
* SYSTEM instructions
* Addresses used by load and store instructions (the data loaded or
stored is still protected, though)
* Shift amounts (but the bits shifted are still protected)

I think it's quite reasonable to have that guarantee in an extension
with no instructions (look at the LR/SC forward progress guarantee for
an example) and then require that extension as part of RVG2.2 : all
known cores already implement it!

> If I understand correctly, (I remember reading this somewhere) the RISC-V
> project uses "2 inputs, 1 output, at most one memory access" as the
> definition of "RISC instruction" to the extent that this is also a criteria
> for candidates for macro-op fusion. Multiplication and division operations
> appear to be "special" in this regard--the actual arithmetic is assumed to
> require multiple cycles, so MULH/MUL fusion does not reduce to a single
> macro-op, but fused MUL reads the other half of the result from fused MULH
> instead of starting another multiplication. I may be mistaken, but I
> believe that MULH/MUL fusion also means that an implementation may write the
> MUL result before the MULH result, even though the sequence is "MULH; MUL";
> I do not know if this would ever be visible, or if fused MULH/MUL is atomic.

RISC-V is designed to be efficiently implementable on various common
microarchitectural families: in-order with bypassed pipelines,
superscalar in-order, out of order, microcoded state machines. The
requirements fall out of that; OOO designs tend to be unhappy with
anything that needs multiple write ports, read ports are a very scarce
resource, and context sensitivity is a pain all around. RISC-V was
not designed by "RISC purists", if it was it wouldn't have things like
AMOADD or FSQRT. So any argument which starts by asking "is this RISC
or not" is not logically relevant here.

> The slides I have seen
> (<URL:http://riscv.org/wp-content/uploads/2015/06/riscv-vector-workshop-june2015.pdf>)
> suggest the existence of at least a vector-fused-multiply-add opcode, so
> there is that. Now that I think about it, vectors like RISC-V is pursuing
> seem almost ideal for implementing bignum arithmetic. The catch, of course,
> is that constant-time vector operations are probably a pipe dream, but
> blinding can help with that. We may want to push for a
> vector-add-yielding-carry operation that produces carry-chain bits or even a
> vector-add-with-carry operation that directly adds vectors as bignums. :)

constant-time vector operations are NOT a pipe dream. First off,
remember that "constant time" applies to an instruction AND an
operand, not to the instruction by itself. It's obvious that the
vector length will affect vector timing, but the vector content
generally won't. Indeed vector code is more likely to be constant
time than the equivalent scalar code, because if you have eight values
going down a pipeline at once, the microarchitectural people are going
to make sure they all get to the end at the same time.

Basically I just want to see the first heading from
http://blog.cr.yp.to/20140517-insns.html .

> If "operand-scanned multiply" is what I know as "schoolbook multiplication",
> that may actually be a plausible vector operation. :)

I can't find "operand-scanned multiply" or any plausible variation in
the last link.

-s

Stefan O'Rear

unread,
Sep 25, 2016, 2:08:54 AM9/25/16
to Christopher Celio, Michael Clark, Mike Hamburg, Samuel Falvo II, RISC-V ISA Dev
On Sat, Sep 24, 2016 at 11:01 PM, Christopher Celio
<ce...@eecs.berkeley.edu> wrote:
> I'm not the V extension author, just a guy who implements the specs. Your
> best "view" into a possible V extension is to read Krste's PhD Thesis on
> implementing vector processors - it's a very good read. I'm not aware of
> any V extension drafts that exist, but off the top of my head, if I were to
> guess, possible ways to do this are either a vector-half reduce (which
> requires multiple iterations) or perhaps a dedicated reduction instruction
> (like a vadd.reduce). Skimming Krste's thesis, section 10.1 has an
> interesting discussion on the pros and cons of a dedicated reduction
> instruction. The actual best ISA design for supporting a dot product is
> surprisingly non-obvious!

Thank you, I'll have to read that later. I found about RISC-V this
year and didn't read anything further back than 2012. (1998, wow!)

-s

Mike Hamburg

unread,
Sep 25, 2016, 3:25:14 AM9/25/16
to Stefan O'Rear, jcb6...@gmail.com, Michael Clark, Samuel Falvo II, RISC-V ISA Dev

> On Sep 24, 2016, at 11:03 PM, Stefan O'Rear <sor...@gmail.com> wrote:
>
> On Sat, Sep 24, 2016 at 9:57 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>> I think that randomized blinding probably will be required for a generic
>> RISC-V cryptographic library--RISC-V tries very hard to be
>> microarchitecture-neutral and that may preclude even basic constant-time
>> primitives in the base ISA.

Maybe, but I think that library authors will prefer floats or trying to reverse-engineer the multiplier, or something else that works but is not guaranteed by the ISA. It’s a pain to write this kind of code, and it’s really hard to determine whether your blinding is good enough. I would deploy this as a secondary defense or as a speedbump against power analysis, but I would not be confident deploying it as the only countermeasure to a software side channel.

Please understand also that if arithmetic is variable-time, then almost any off-the-shelf crypto software will appear to work, will pass test vectors and interoperate, but won’t actually be secure. OpenSSL and libcrypto, broken. LibCrypto++, broken. Curve25519-donna, broken. LibSodium, broken. LibDecaf, broken. LibTomCrypt, broken. Relic, DCLXVI, MPFQ, MIRACL, all broken. This community ought to think long and hard before setting such a big trap.

> I think it's quite reasonable to say that instructions from the base
> "I" ISA are not allowed to cause information to flow from their
> operands into "non-architectural state which might affect future
> nondeterministic behavior" (so leaving a cryptographic key in an
> unused slot of the physical register file is fine as long as it can
> never be read by software), with the following exceptions:
>
> * The identity of executing instructions, and the PC value
> * Branch conditions of branches that cannot affect the PC (I don't
> know why you would do that, but OK)
> * SYSTEM instructions
> * Addresses used by load and store instructions (the data loaded or
> stored is still protected, though)
> * Shift amounts (but the bits shifted are still protected)
>
> I think it's quite reasonable to have that guarantee in an extension
> with no instructions (look at the LR/SC forward progress guarantee for
> an example) and then require that extension as part of RVG2.2 : all
> known cores already implement it!

This sounds like a good solution to me. Maybe a future version of the M extension could have a MUL.NI instruction and friends that also have these guarantees.

I don’t care that much about performance on standard RV processors, just security. CRI will probably design and implement our own extension, and we won’t have as many constraints as for a standard extension. Other companies building embedded security cores will do the same.

The UMAAL instruction (as on ARM) is apparently outside of what can be standardized, at least for now, but it replaces 8 instructions:

MUL lo, x, y
MULHU hi, x, y
ADD lo, lo, accum1
SLT c, lo, accum1
ADD hi, hi, c
ADD lo, lo, accum2
SLT c, lo, accum2
ADD hi, hi, c

It’d be nice to reduce this cost at all, even if we can’t get it down to 1 instruction in the standard. I’m not sure how though.

>> The slides I have seen
>> (<URL:http://riscv.org/wp-content/uploads/2015/06/riscv-vector-workshop-june2015.pdf>)
>> suggest the existence of at least a vector-fused-multiply-add opcode, so
>> there is that. Now that I think about it, vectors like RISC-V is pursuing
>> seem almost ideal for implementing bignum arithmetic. The catch, of course,
>> is that constant-time vector operations are probably a pipe dream, but
>> blinding can help with that. We may want to push for a
>> vector-add-yielding-carry operation that produces carry-chain bits or even a
>> vector-add-with-carry operation that directly adds vectors as bignums. :)
>
> constant-time vector operations are NOT a pipe dream. First off,
> remember that "constant time" applies to an instruction AND an
> operand, not to the instruction by itself. It's obvious that the
> vector length will affect vector timing, but the vector content
> generally won't. Indeed vector code is more likely to be constant
> time than the equivalent scalar code, because if you have eight values
> going down a pipeline at once, the microarchitectural people are going
> to make sure they all get to the end at the same time.

I agree, nobody wants to design a variable-latency vector lane if they don’t have to.

> Basically I just want to see the first heading from
> http://blog.cr.yp.to/20140517-insns.html .
>
>> If "operand-scanned multiply" is what I know as "schoolbook multiplication",
>> that may actually be a plausible vector operation. :)
>
> I can't find "operand-scanned multiply" or any plausible variation in
> the last link.
>
> -s

By operand-scanned multiply, I mean this expression of the schoolbook algorithm:

multiplicand x is a vector
multiplier y is an array of words
accum is a vector which starts at 0

for m in y:
accum = accum + x*m
shift accum down one word and write the low word out to memory

write accum out to memory

Potentially the “accum = accum + x*m” step can be a vector instruction, with some sort of carry handling but I’m not entirely sure what. I had some discussions with Watson Ladd about how to design these at CRYPTO/CHES. We didn’t get a complete proposal, but maybe we can continue discussing at the RISC-V workshop. First I’d want to get better idea of how many ports the other vector ops are likely to use, just so I don’t get shot down again because read ports are expensive.

An operand-scanned multiply instruction can be more powerful on a µarch which has a systolic vector unit. This will be true for some RISC-V architectures but isn’t true for Intel. Even on a tiny processor where the vector unit has little to no parallelism, this sort of instruction could be a win because it reduces loop overhead, and might be able to perform some kind of carry propagation.

— Mike

Stefan O'Rear

unread,
Sep 25, 2016, 4:03:09 AM9/25/16
to Mike Hamburg, jcb6...@gmail.com, Michael Clark, Samuel Falvo II, RISC-V ISA Dev
On Sun, Sep 25, 2016 at 12:25 AM, Mike Hamburg <mi...@shiftleft.org> wrote:
> I don’t care that much about performance on standard RV processors, just security. CRI will probably design and implement our own extension, and we won’t have as many constraints as for a standard extension. Other companies building embedded security cores will do the same.
>
> The UMAAL instruction (as on ARM) is apparently outside of what can be standardized, at least for now, but it replaces 8 instructions:
>
> MUL lo, x, y
> MULHU hi, x, y
> ADD lo, lo, accum1
> SLT c, lo, accum1
> ADD hi, hi, c
> ADD lo, lo, accum2
> SLT c, lo, accum2
> ADD hi, hi, c
>
> It’d be nice to reduce this cost at all, even if we can’t get it down to 1 instruction in the standard. I’m not sure how though.

Go ahead with a non-standard extension. RV64Gv2.0 is the least common
denominator; I personally expect the RISC-V ecosystem of the next ~5
years to be extremely heterogenous, with application-specific
extensions all around, until market forces tell the Foundation what to
standardize. (part of why I'd like to see prioritization on
developing our version of CPUID)

I don't expect RV64G to grow much, but I do expect to see one or more
standard extensions for "I'm already paying for a 32KB D$, I don't
care about a few thousand more gates in (one of) my ALU(s)". The "B"
instruction might be one of those; it's currently treated as a
placeholder for things like POPCOUNT, but we'll see.

> By operand-scanned multiply, I mean this expression of the schoolbook algorithm:
>
> multiplicand x is a vector
> multiplier y is an array of words
> accum is a vector which starts at 0
>
> for m in y:
> accum = accum + x*m
> shift accum down one word and write the low word out to memory
>
> write accum out to memory
>
> Potentially the “accum = accum + x*m” step can be a vector instruction, with some sort of carry handling but I’m not entirely sure what. I had some discussions with Watson Ladd about how to design these at CRYPTO/CHES. We didn’t get a complete proposal, but maybe we can continue discussing at the RISC-V workshop. First I’d want to get better idea of how many ports the other vector ops are likely to use, just so I don’t get shot down again because read ports are expensive.

I think this is a case where first you should design the hardware,
then the spec. If you have a HDL RVV implementation, go ahead and add
carry support (make up a design), and then we'll be able to have an
evidence-based discussion about the ISA design.

> An operand-scanned multiply instruction can be more powerful on a µarch which has a systolic vector unit. This will be true for some RISC-V architectures but isn’t true for Intel. Even on a tiny processor where the vector unit has little to no parallelism, this sort of instruction could be a win because it reduces loop overhead, and might be able to perform some kind of carry propagation.

Yes. RISC-V is being born into an increasingly heterogenous world;
some of that can be hidden at the architectural level, some of it
can't, and I'm not in a position to know which is which.

-s

Michael Clark

unread,
Sep 25, 2016, 3:26:58 PM9/25/16
to Stefan O'Rear, Christopher Celio, Mike Hamburg, Samuel Falvo II, RISC-V ISA Dev

On 25/09/2016, at 6:33 PM, Stefan O'Rear <sor...@gmail.com> wrote:

It’s a big question whether a typical RV128I ABI will store 64-bit offsets.
It reminds me of x32 which is x86-64 set of 16 registers but with 32-bit
pointers. I presume 64-bit math is fast with the x32 ABI…

64-bit math isn't the reason you use x32 over classical multiarch.
The reason is reducing spills.

The reason people use x32 is to reduce pointer size. There are no more spills on x32 than there are on x64 and the processor running x32 supports 64-bit registers and 64-bit math. I'll take some time to reverse 64-bit math on x32...

Michael Clark

unread,
Sep 25, 2016, 3:35:49 PM9/25/16
to Stefan O'Rear, Christopher Celio, Mike Hamburg, Samuel Falvo II, RISC-V ISA Dev

On 25/09/2016, at 6:33 PM, Stefan O'Rear <sor...@gmail.com> wrote:

The constant time property (if implemented for integer shift, mul, div) is
not applicable to loads and stores as we know they can vary substantially.
They match these existing criteria:

This statement is dangerously wrong.  Loads and stores leak the
address but they MUST keep the value loaded or stored secret.

I don't know where you are reading my non-secret property. 

The reason I'm interested in the sharded pointer (upper 64-bit loaded in X-only code) is for that reason.

I was referring to the non constant time property of LOAD and STORE due to non deterministic cache interactions (or worse still, determined by a collocated attacker in another VM sharing L3 cache and abusing cache coherence policy to leak timing information).

We both know there are ways to solve the cache determinism/non-determinism problem... which diverge from what is available on today's commodity architectures...

Michael Clark

unread,
Sep 25, 2016, 3:47:53 PM9/25/16
to Stefan O'Rear, Christopher Celio, Mike Hamburg, Samuel Falvo II, RISC-V ISA Dev

On 25/09/2016, at 6:33 PM, Stefan O'Rear <sor...@gmail.com> wrote:

In the meta model for the RISC-V ops, these would share a similar property
to the JAL and JALR pseudos, in that the same opcode is overloaded with more
than one format, and LI, that expands to a sequence. In fact JAL and JALR
are currently causing a problem for the model as they are the only two
RISC-V ops that “op overload” or use the same name for a pseudo operation
with a different format.

I don't know what you mean by this.  Neither JAL nor JALR (which are
instructions, not pseudos) shares a major opcode with anything else.
Are you maybe thinking of the J and JR pseudos?

Lots of instructions can be encoded either 2-byte or 4, LD and SD are not alone.

Have a look at the JAL and JALR pseudos. I have them commented out in the implementation I am working on as they are the only two operators that are overloaded with more than one format (format with 'ra' omitted). All other pseudos with 1:1 mappings to real ops have distinct formats. Assembler needs to support overloading. i.e. mangle operand types as suffix to opcode names to distinguish them.

I had a subtle problem related to this with shift instructions as they have different constraints on the immediate depending on bit width. Internally they get suffixed with their ISA membership. e.g.

- SRLI.RV32
- SRLI.RV64
- SRLI.RV128

They have different illegal instruction semantics depending on the presence of upper 1 bits in the shift immediate.

This is not a deficiency in RISC-V. It's a deficiency in my attempt to model the ISA.

Michael Clark

unread,
Sep 25, 2016, 3:53:08 PM9/25/16
to Stefan O'Rear, Christopher Celio, Mike Hamburg, Samuel Falvo II, RISC-V ISA Dev

On 25/09/2016, at 6:33 PM, Stefan O'Rear <sor...@gmail.com> wrote:

For RV128 I think we're going to see the return of "far".  Most
software uses 64-bit addresses exclusively, but mmap128 returns a
"void *__far" which takes 16 bytes and can be used to access amounts
of memory far larger than your program.  Note that both near and far
pointers can be handled using the existing load and store
instructions, since 64-bit values are sign extended (you will never
see "garbage in the upper half of a register" on RISC-V).

Good to know. Attributes for far pointers and a default to compact pointers. There is still the issue of the canonical reference point for near pointers. e.g. gp

The ADD could be fused and we could live without a pseudo for SD, LD that performs a register register ADD. However one notes that the micro architects are going to need to optimise for what comes out of the compiler.

Stefan O'Rear

unread,
Sep 25, 2016, 5:21:12 PM9/25/16
to Michael Clark, Christopher Celio, Mike Hamburg, Samuel Falvo II, RISC-V ISA Dev
I said "over classical multiarch", in other words, if you're comparing
x32 to i386 (with an x86_64 kernel in either case), _the pointer size
is the same_, but x32 gives you (a) faster long long arithmetic (b)
twice as many GPRs. The latter is more important in practice.

-s

Jacob Bachmeyer

unread,
Sep 25, 2016, 9:22:56 PM9/25/16
to Christopher Celio, Stefan O'Rear, Michael Clark, Mike Hamburg, Samuel Falvo II, RISC-V ISA Dev
Christopher Celio wrote:
>>> If I understand correctly, (I remember reading this somewhere) the
>>> RISC-V project uses "2 inputs, 1 output, at most one memory access"
>>> as the definition of "RISC instruction" to the extent that this is
>>> also a criteria for candidates for macro-op fusion. Multiplication
>>> and division operations appear to be "special" in this regard--the
>>> actual arithmetic is assumed to require multiple cycles, so MULH/MUL
>>> fusion does not reduce to a single macro-op, but fused MUL reads the
>>> other half of the result from fused MULH instead of starting another
>>> multiplication. I may be mistaken, but I believe that MULH/MUL
>>> fusion also means that an implementation may write the MUL result
>>> before the MULH result, even though the sequence is "MULH; MUL"; I
>>> do not know if this would ever be visible, or if fused MULH/MUL is
>>> atomic.
>
>
> I put out a tech report recently on RISC-V and macro-op fusion, and I
> loosely referenced what I thought qualifies as a RISC instruction, but
> I in no way speak for RISC-V.
>
> https://arxiv.org/abs/1607.02318
>
> The fused MULH/MUL would be atomic, as you would not service an
> interrupt until finished and no intermediate instruction would come
> between them. Any OoO execution must strictly obey true program
> dependences.

Yes, now that I look at it again, I did get "2 inputs, 1 output, at most
one memory access" from EECS-2016-130. I prefixed that statement with
"If I understand correctly" because I was not certain that I had it from
an official source. My caution was warranted--that may not actually be
the definition RISC-V uses.

Michael Clark

unread,
Sep 25, 2016, 10:36:25 PM9/25/16
to jcb6...@gmail.com, Christopher Celio, Stefan O'Rear, Mike Hamburg, Samuel Falvo II, RISC-V ISA Dev
FM[N]{ADD,SUB}.{S,D,Q} violates this constraint.

Sent from my iPhone
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Allen Baum

unread,
Sep 26, 2016, 3:01:56 AM9/26/16
to Michael Clark, jcb6...@gmail.com, Christopher Celio, Stefan O'Rear, Mike Hamburg, Samuel Falvo II, RISC-V ISA Dev
As I recall from waaaay earlier in the thread, 2 seconds, 1 fest was for integer ops; floating ops we're specifically 3 sec, 2 dest

-Allen

On Sep 26, 2016, at 4:36 AM, Michael Clark <michae...@mac.com> wrote:

FM[N]{ADD,SUB}.{S,D,Q} violates this constraint.

Sent from my iPhone

On 26/09/2016, at 2:22 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

Christopher Celio wrote:
If I understand correctly, (I remember reading this somewhere) the RISC-V project uses "2 inputs, 1 output, at most one memory access" as the definition of "RISC instruction" to the extent that this is also a criteria for candidates for macro-op fusion.  
.

Jacob Bachmeyer

unread,
Sep 26, 2016, 6:25:41 PM9/26/16
to Michael Clark, Christopher Celio, Stefan O'Rear, Mike Hamburg, Samuel Falvo II, RISC-V ISA Dev
Michael Clark wrote:
> On 26/09/2016, at 2:22 PM, Jacob Bachmeyer <jcb6...@gmail.com
> [quoted text moved to bottom to ensure the conversation can continue
> without confusion]
> FM[N]{ADD,SUB}.{S,D,Q} violates this constraint.

Yes, but those are floating-point operations. The actual constraint in
RISC-V seems to be that the integer register file may not be required to
have more than two read ports and one write port. Also note that the
"R4" format those use consumes an entire major opcode for each of
FMADD/FMSUB/FNMADD/FNMSUB.

-- Jacob

Jacob Bachmeyer

unread,
Sep 26, 2016, 7:01:39 PM9/26/16
to Mike Hamburg, Stefan O'Rear, Michael Clark, Samuel Falvo II, RISC-V ISA Dev
Mike Hamburg wrote:
> On Sep 24, 2016, at 11:03 PM, Stefan O'Rear <sor...@gmail.com> wrote:
>
>> On Sat, Sep 24, 2016 at 9:57 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>
>>> I think that randomized blinding probably will be required for a generic
>>> RISC-V cryptographic library--RISC-V tries very hard to be
>>> microarchitecture-neutral and that may preclude even basic constant-time
>>> primitives in the base ISA.
>>>
>
> Maybe, but I think that library authors will prefer floats or trying to reverse-engineer the multiplier, or something else that works but is not guaranteed by the ISA. It’s a pain to write this kind of code, and it’s really hard to determine whether your blinding is good enough. I would deploy this as a secondary defense or as a speedbump against power analysis, but I would not be confident deploying it as the only countermeasure to a software side channel.
>
> Please understand also that if arithmetic is variable-time, then almost any off-the-shelf crypto software will appear to work, will pass test vectors and interoperate, but won’t actually be secure. OpenSSL and libcrypto, broken. LibCrypto++, broken. Curve25519-donna, broken. LibSodium, broken. LibDecaf, broken. LibTomCrypt, broken. Relic, DCLXVI, MPFQ, MIRACL, all broken. This community ought to think long and hard before setting such a big trap.
>

Not just appear to work--it *will* work in the algorithmic sense, but
will leak information if an adversary can observe timing, which is far
more dangerous because there is no indication that anything is wrong.
In another thread I proposed forbidding S-mode instruction fetch from
user pages, and had to explain more-or-less this type of invisible
failure mode when a flag to relax that restriction was suggested--that
the existence of an option to do it wrong is extremely dangerous,
especially when there is no effect from doing it wrong until you get
exploited. I think I feel your pain here.

To be clear, I agree that variable-time arithmetic makes RISC-V
extremely dangerous for on-line cryptographic use and that we should
push for some kind of non-interference guarantee.

>> I think it's quite reasonable to say that instructions from the base
>> "I" ISA are not allowed to cause information to flow from their
>> operands into "non-architectural state which might affect future
>> nondeterministic behavior" (so leaving a cryptographic key in an
>> unused slot of the physical register file is fine as long as it can
>> never be read by software), with the following exceptions:
>>
>> * The identity of executing instructions, and the PC value
>> * Branch conditions of branches that cannot affect the PC (I don't
>> know why you would do that, but OK)
>> * SYSTEM instructions
>> * Addresses used by load and store instructions (the data loaded or
>> stored is still protected, though)
>> * Shift amounts (but the bits shifted are still protected)
>>
>> I think it's quite reasonable to have that guarantee in an extension
>> with no instructions (look at the LR/SC forward progress guarantee for
>> an example) and then require that extension as part of RVG2.2 : all
>> known cores already implement it!
>>
>
> This sounds like a good solution to me. Maybe a future version of the M extension could have a MUL.NI instruction and friends that also have these guarantees.
>

I agree with Stefan's suggestion in principle, and with the exceptions
listed except for allowing useless branch conditions to leak, but the
restriction as written does not seem to cover timing leaks, since the
cycle, time, and instret CSRs could be argued to be architectural state.

I believe that a non-interference requirement should additionally
mandate treating useless branches as microarchitectural NOPs. They are
NOPs, but permitting the branch condition to leak could open a door for
invisible sabotage. Branches that *do* have an effect can leak their
condition, since PC is visible and it will leak that way, but meaningful
branches must also be correct or the program will fail. Useless
branches could be inserted without being noticed, with arbitrary
conditions, presumably dependent on key bits. Permitting those to leak
opens a door for quiet subversion.

Non-interference could be an ISA extension, or be declared by some other
means, perhaps a tag in the configuration string.

> I don’t care that much about performance on standard RV processors, just security. CRI will probably design and implement our own extension, and we won’t have as many constraints as for a standard extension. Other companies building embedded security cores will do the same.
>

I get the impression that the RISC-V project envisions dedicated
cryptographic accelerators and may not understand the mistrust of such
"black boxes" in the security community.

> The UMAAL instruction (as on ARM) is apparently outside of what can be standardized, at least for now, but it replaces 8 instructions:
>
> MUL lo, x, y
> MULHU hi, x, y
> ADD lo, lo, accum1
> SLT c, lo, accum1
> ADD hi, hi, c
> ADD lo, lo, accum2
> SLT c, lo, accum2
> ADD hi, hi, c
>
> It’d be nice to reduce this cost at all, even if we can’t get it down to 1 instruction in the standard. I’m not sure how though.
>

So UMAAL is essentially "(x*y) + accum1 + accum2 --> {hi, lo}"? Can you
give me more context? Why is accum2 needed instead of a simple
fused-multiply-add?

Previous discussion on this thread about add-with-carry mentioned that
WIDE-ADD could be done using an internal carry flag and a form of
macro-op fusion, but that worked because the intermediate carry was
overwritten with the final result. I think the problem with UMAAL is
that it requires four read and two write ports on the register file,
which ARM implementations are likely to have for some of the other
complex ARM opcodes, but RISC-V tries very hard to minimize the required
number of register file ports and so far has kept them to "2 read/1
write", at which the project seems to be intent on holding the line.

>>> The slides I have seen
>>> (<URL:http://riscv.org/wp-content/uploads/2015/06/riscv-vector-workshop-june2015.pdf>)
>>> suggest the existence of at least a vector-fused-multiply-add opcode, so
>>> there is that. Now that I think about it, vectors like RISC-V is pursuing
>>> seem almost ideal for implementing bignum arithmetic. The catch, of course,
>>> is that constant-time vector operations are probably a pipe dream, but
>>> blinding can help with that. We may want to push for a
>>> vector-add-yielding-carry operation that produces carry-chain bits or even a
>>> vector-add-with-carry operation that directly adds vectors as bignums. :)
>>>
>> constant-time vector operations are NOT a pipe dream. First off,
>> remember that "constant time" applies to an instruction AND an
>> operand, not to the instruction by itself. It's obvious that the
>> vector length will affect vector timing, but the vector content
>> generally won't. Indeed vector code is more likely to be constant
>> time than the equivalent scalar code, because if you have eight values
>> going down a pipeline at once, the microarchitectural people are going
>> to make sure they all get to the end at the same time.
>>
>
> I agree, nobody wants to design a variable-latency vector lane if they don’t have to.
>

Good, for once I can be happy to be wrong.

>> Basically I just want to see the first heading from
>> http://blog.cr.yp.to/20140517-insns.html .
>>
>>> If "operand-scanned multiply" is what I know as "schoolbook multiplication",
>>> that may actually be a plausible vector operation. :)
>>>
>> I can't find "operand-scanned multiply" or any plausible variation in
>> the last link.
> By operand-scanned multiply, I mean this expression of the schoolbook algorithm:
>
> multiplicand x is a vector
> multiplier y is an array of words
> accum is a vector which starts at 0
>
> for m in y:
> accum = accum + x*m
> shift accum down one word and write the low word out to memory
>
> write accum out to memory
>
> Potentially the “accum = accum + x*m” step can be a vector instruction, with some sort of carry handling but I’m not entirely sure what. I had some discussions with Watson Ladd about how to design these at CRYPTO/CHES. We didn’t get a complete proposal, but maybe we can continue discussing at the RISC-V workshop. First I’d want to get better idea of how many ports the other vector ops are likely to use, just so I don’t get shot down again because read ports are expensive.
>
> An operand-scanned multiply instruction can be more powerful on a µarch which has a systolic vector unit. This will be true for some RISC-V architectures but isn’t true for Intel. Even on a tiny processor where the vector unit has little to no parallelism, this sort of instruction could be a win because it reduces loop overhead, and might be able to perform some kind of carry propagation.
>

The problem with the simple vector-add-bignum is that carries must
propagate between vector lanes. I have been thinking about this, and
vector-fused-multiply-add is very expensive to encode, requiring an
entire quadrant of a major opcode just to represent the operands, and
that is assuming that arithmetic mode (integer/fixed/float) is part of
the vector unit state. (Each of three inputs can be a vector register
or a scalar register, so 3*6=18 bits for input registers and 5 bits for
the destination vector comes to 23 bits out of 25 available within a
major opcode in the base 32-bit encoding.) Introducing variants of VFMA
(as a previous draft of this reply suggested) is a non-starter.

Propagating carries between vector lanes would add considerable
complexity and still leave the problem of carry-in/carry-out unsolved.
Would some form of vector-save-carry that causes a following vector
addition or vector-fused-multiply-add to save carry chain information
into another vector register perhaps help? The broad outline I am
planning to propose for the vector group to consider would have a
vector-process-carry instruction that converts a carry chain vector into
a vector of integers that can be added to the result of a vector
addition to perform (part of) a bignum addition. The
vector-process-carry instruction could have its own functional unit that
should be relatively small and the carry-chain information should fit in
8-bit elements, or possibly smaller. I have not worked out the math to
determine if carry chains could be meaningfully combined before processing.

What seems to be the current approach would allow VFMA to do an
expanding multiply, by using a vector of 64-bit elements as the
destination while the source operands have 32-bit elements. To use this
for operand-scanned multiply, a vector-reduce-bignum operation would be
needed that adds the high half of each element to the low half of the
next element, probably saving carry information between the result
elements, like any other vector addition, reducing the element size by
half, dropping the least significant element of the result (which was
the low half of the least significant element of the input and was
unchanged), and shifting the result downwards by one element, thus
opening room for the high half of the highest element to be the highest
element of the result.


-- Jacob

Mike Hamburg

unread,
Sep 26, 2016, 8:16:39 PM9/26/16
to jcb6...@gmail.com, Stefan O'Rear, Michael Clark, Samuel Falvo II, RISC-V ISA Dev
On Sep 26, 2016, at 4:01 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

Mike Hamburg wrote:
On Sep 24, 2016, at 11:03 PM, Stefan O'Rear <sor...@gmail.com> wrote:
I think it's quite reasonable to say that instructions from the base
"I" ISA are not allowed to cause information to flow from their
operands into "non-architectural state which might affect future
nondeterministic behavior" (so leaving a cryptographic key in an
unused slot of the physical register file is fine as long as it can
never be read by software), with the following exceptions:

* The identity of executing instructions, and the PC value
* Branch conditions of branches that cannot affect the PC (I don't
know why you would do that, but OK)
* SYSTEM instructions
* Addresses used by load and store instructions (the data loaded or
stored is still protected, though)
* Shift amounts (but the bits shifted are still protected)

I think it's quite reasonable to have that guarantee in an extension
with no instructions (look at the LR/SC forward progress guarantee for
an example) and then require that extension as part of RVG2.2 : all
known cores already implement it!
   

This sounds like a good solution to me.  Maybe a future version of the M extension could have a MUL.NI instruction and friends that also have these guarantees.
 

I agree with Stefan's suggestion in principle, and with the exceptions listed except for allowing useless branch conditions to leak, but the restriction as written does not seem to cover timing leaks, since the cycle, time, and instret CSRs could be argued to be architectural state.

Yeah, phrasing a non-interference property is going to be pretty hard in general.  But you’re right, we need to make it clear that a timing leak is a leak.

To be formal, we probably will have to specify something very complete.  For example, perhaps we will need to specify that the RISC-V processor cannot allow data to flow to externally visible state, except through [long list of allowable paths].  This will be very tricky, because of corner cases such as speculative execution, instructions fetched but not executed, and the generally complex list of things which should and shouldn't leak for security.  For example, most CSRs have side effects and so can leak, but not exception handler shadow registers.

But it would be nice to start if we had some informal guarantees, in the direction of what Stefan said with your correction.

I believe that a non-interference requirement should additionally mandate treating useless branches as microarchitectural NOPs.  They are NOPs, but permitting the branch condition to leak could open a door for invisible sabotage.  Branches that *do* have an effect can leak their condition, since PC is visible and it will leak that way, but meaningful branches must also be correct or the program will fail. Useless branches could be inserted without being noticed, with arbitrary conditions, presumably dependent on key bits.  Permitting those to leak opens a door for quiet subversion.

It probably isn’t feasible to prevent quiet subversion.  There are lots of ways to add code that doesn’t affect the result but does leak state.  Making useless branches into NOPs is only one way to do it, and if checking this adds a lot of overhead in hardware I don’t think it should be mandated.

For super-high-level certification, you need really strong isolation guarantees.  The simplest step would be to make branch prediction explicitly a per-thread resource, but you might also need isolated caches, timers, etc.

I get the impression that the RISC-V project envisions dedicated cryptographic accelerators and may not understand the mistrust of such "black boxes" in the security community.

Beyond the mistrust, there’s simply the problem that not everyone will port their code to the accelerator.

The UMAAL instruction (as on ARM) is apparently outside of what can be standardized, at least for now, but it replaces 8 instructions:

MUL lo, x, y
MULHU hi, x, y
ADD lo, lo, accum1
SLT c, lo, accum1
ADD hi, hi, c
ADD lo, lo, accum2
SLT c, lo, accum2
ADD hi, hi, c

It’d be nice to reduce this cost at all, even if we can’t get it down to 1 instruction in the standard.  I’m not sure how though.
 

So UMAAL is essentially "(x*y) + accum1 + accum2 --> {hi, lo}"?  Can you give me more context? Why is accum2 needed instead of a simple fused-multiply-add?

For operand scanning (schoolbook by rows), in each run of the inner loop you have an N-word accum, N-word y, and 1-word x (where x is iterated in the outer loop.)

So for the first word, you want to compute x*y[0] + a[0], which is multiply-add-add and will return {hi,lo}.  But for the next word, you need to compute x*y[1] + a[1] + hi, which is a multiply-add-add.  Conveniently, this cannot carry (its maximum output is exactly 2^64-1, or 2^128-1 or whatever) so it returns a hi,lo as well, so the same pattern holds for the rest of the words.

If you extend the desired bignum operation to an N-word mul-add-add, then each individual operation will be exactly a 1->2 word mul-add-add.  This makes UMAAL a very useful instruction.  However, I have only seen it on ARM, so I don’t know if it is patented.

For product scanning (schoolbook by columns instead of by rows), the inner loop is (x*y) + {hi,lo} -> {hi,lo}, which in ARM parlance would be UMLAL.  Product scanning is less regular than operand scanning, but requires fewer memory writes if your regfile isn’t big enough.   Unfortunately, UMLAL can carry, which adds more complexity.  However, it is a good match for an extension that adds eg a dedicated 3-word accumulator register, where the 3rd word absorbs the carries.  It is also a good match for reduced-radix operations, where internally the numbers are represented in eg base 2^28 instead of 2^32, and the upper bits are 0 to allow some number of carries to accumulate before dealing with them.

Previous discussion on this thread about add-with-carry mentioned that WIDE-ADD could be done using an internal carry flag and a form of macro-op fusion, but that worked because the intermediate carry was overwritten with the final result.  I think the problem with UMAAL is that it requires four read and two write ports on the register file, which ARM implementations are likely to have for some of the other complex ARM opcodes, but RISC-V tries very hard to minimize the required number of register file ports and so far has kept them to "2 read/1 write", at which the project seems to be intent on holding the line.

This is certainly a problem.

In most practical UMAAL loops, the hi will be forwarded to one of the next accumulators, and one of the x,y (let’s say x) will be repeated many times.  This allows a 2->1 pattern in the inner loop in the common case.  But it’s not clear how to implement it with macro-op fusion since it’s 8 instructions.  Perhaps there is a shorter sequence of 2->1 instructions (which might not yet be RISC-V instructions), and we could spec those and then use fusion.

The problem with the simple vector-add-bignum is that carries must propagate between vector lanes.

This is certainly a problem.  There is a legitimate discussion to be had on whether this sort of reducing operation has a place in RISC-V’s V extension.

 I have been thinking about this, and vector-fused-multiply-add is very expensive to encode, requiring an entire quadrant of a major opcode just to represent the operands, and that is assuming that arithmetic mode (integer/fixed/float) is part of the vector unit state.  (Each of three inputs can be a vector register or a scalar register, so 3*6=18 bits for input registers and 5 bits for the destination vector comes to 23 bits out of 25 available within a major opcode in the base 32-bit encoding.)  Introducing variants of VFMA (as a previous draft of this reply suggested) is a non-starter.

I’ll be impressed if the V extension avoids going to wider encodings.

Propagating carries between vector lanes would add considerable complexity and still leave the problem of carry-in/carry-out unsolved.  Would some form of vector-save-carry that causes a following vector addition or vector-fused-multiply-add to save carry chain information into another vector register perhaps help?  The broad outline I am planning to propose for the vector group to consider would have a vector-process-carry instruction that converts a carry chain vector into a vector of integers that can be added to the result of a vector addition to perform (part of) a bignum addition.  The vector-process-carry instruction could have its own functional unit that should be relatively small and the carry-chain information should fit in 8-bit elements, or possibly smaller.  I have not worked out the math to determine if carry chains could be meaningfully combined before processing.

The carries themselves can be 1-bit elements.  Then each word will need to generate a bit that says whether it is all 1s.  Computing the final carry chain is something really simple, like (carries + propagators) ^ propagators, and can be tapped out from the internal structure of an adder that computes (carries + propagators) IIRC.  Then you have to apply the result, and possibly compute a final carry out.

What seems to be the current approach would allow VFMA to do an expanding multiply, by using a vector of 64-bit elements as the destination while the source operands have 32-bit elements.  To use this for operand-scanned multiply, a vector-reduce-bignum operation would be needed that adds the high half of each element to the low half of the next element, probably saving carry information between the result elements, like any other vector addition, reducing the element size by half, dropping the least significant element of the result (which was the low half of the least significant element of the input and was unchanged), and shifting the result downwards by one element, thus opening room for the high half of the highest element to be the highest element of the result.


-- Jacob

Yes, something like this will be required.  It’s also possible to have one instruction compute the low half, and another the high half.  Also, it is possible to defer the carries during the loop body at the cost of more registers and bandwidth, but some kind of cross-lane access is still required.

— Mike

Ray Van De Walker

unread,
Sep 26, 2016, 8:26:11 PM9/26/16
to RISC-V ISA Dev
About the cryptographic requirements;
Should they be for all RISC-V CPUs?
Or should they be an additional design criteria, say "s", indicating that a CPU is "securable?"

Samuel Falvo II

unread,
Sep 26, 2016, 8:35:48 PM9/26/16
to Ray Van De Walker, RISC-V ISA Dev
On Mon, Sep 26, 2016 at 5:26 PM, Ray Van De Walker
<ray.van...@silergy.com> wrote:
> Or should they be an additional design criteria, say "s", indicating that a CPU is "securable?"

I got bit by this before; S is already reserved for the standard
supervisor/privilege ISA specification compliance.

Choose another letter. ;) Perhaps O for cryptO?

--
Samuel A. Falvo II

Stefan O'Rear

unread,
Sep 26, 2016, 8:38:30 PM9/26/16
to Mike Hamburg, jcb6...@gmail.com, Michael Clark, Samuel Falvo II, RISC-V ISA Dev
On Mon, Sep 26, 2016 at 5:16 PM, Mike Hamburg <mi...@shiftleft.org> wrote:
> I’ll be impressed if the V extension avoids going to wider encodings.

Hwacha is using 64-bit instructions, so I expect RVV will as well.

-s

Stefan O'Rear

unread,
Sep 26, 2016, 8:40:46 PM9/26/16
to Ray Van De Walker, RISC-V ISA Dev
On Mon, Sep 26, 2016 at 5:26 PM, Ray Van De Walker
<ray.van...@silergy.com> wrote:
I'm imagining they would be a new criterion, but included in "G", so
if your system is intended to run generic binaries (either has or is
prepared to emulate a FPU, integer multiply/divide, and atomics), it
should have the minimal side-channel protection required by generic
implementations of cryptoprimitives.

-s

Jacob Bachmeyer

unread,
Sep 26, 2016, 11:11:22 PM9/26/16
to Stefan O'Rear, Ray Van De Walker, RISC-V ISA Dev
I agree with this--current practice is to perform cryptography on
general-purpose machines. RISC-V should not invisibly break security in
the common case. "G" really should be a promise that the system is
appropriate for general use. Ultra-tiny embedded systems need not be
held to the same standard.

-- Jacob

Jacob Bachmeyer

unread,
Sep 27, 2016, 8:36:59 PM9/27/16
to Mike Hamburg, Stefan O'Rear, Michael Clark, Samuel Falvo II, RISC-V ISA Dev
Mike Hamburg wrote:
>
>> On Sep 26, 2016, at 4:01 PM, Jacob Bachmeyer <jcb6...@gmail.com
>> <mailto:jcb6...@gmail.com>> wrote:
>>
>> Mike Hamburg wrote:
>>> On Sep 24, 2016, at 11:03 PM, Stefan O'Rear <sor...@gmail.com
You are correct in the general case, but useless branches as an
exception to a general non-interference rule just struck me as highly
odd, unlikely to help an implementation, and almost tailor-made for
quiet subversion. On the other hand, I believe that they were honestly
included in the list, since meaningful branches *do* leak their
conditionals, and having the same opcode be permitted to leak state in
one case and required to be leak-proof in another case also seems very
odd until subversion is considered.

> Making useless branches into NOPs is only one way to do it, and if
> checking this adds a lot of overhead in hardware I don’t think it
> should be mandated.

A useless branch can be recognized by the instruction decoder: the
offset is the length of the branch instruction. (RISC-V user spec v2.1;
pages 16, 17, 85; PDF pages 25, 26, 95) I agree that requiring them to
be treated as NOPs may be a bit too strong, but state details that do
not affect control flow should generally not be allowed to leak. After
all, the inputs to a multiplication are "state details that do not
affect control flow". :)

> For super-high-level certification, you need really strong isolation
> guarantees. The simplest step would be to make branch prediction
> explicitly a per-thread resource, but you might also need isolated
> caches, timers, etc.

If that level of isolation guarantee is as detrimental to performance as
it looks, it should be an extension and should not be rolled into "G".
On the other hand, "G" should be "safe enough" for a network-connected
workstation in a reasonably secure (attacker has no physical access)
environment.

>> I get the impression that the RISC-V project envisions dedicated
>> cryptographic accelerators and may not understand the mistrust of
>> such "black boxes" in the security community.
>
> Beyond the mistrust, there’s simply the problem that not everyone will
> port their code to the accelerator.

And the problem that different implementations will almost certainly
have different accelerators, which makes porting even less likely.
Patents expire, but "infeasible to implement on any other architecture"
could be forever. :)

> For product scanning (schoolbook by columns instead of by rows), the
> inner loop is (x*y) + {hi,lo} -> {hi,lo}, which in ARM parlance would
> be UMLAL. Product scanning is less regular than operand scanning, but
> requires fewer memory writes if your regfile isn’t big enough.
> Unfortunately, UMLAL can carry, which adds more complexity. However,
> it is a good match for an extension that adds eg a dedicated 3-word
> accumulator register, where the 3rd word absorbs the carries. It is
> also a good match for reduced-radix operations, where internally the
> numbers are represented in eg base 2^28 instead of 2^32, and the upper
> bits are 0 to allow some number of carries to accumulate before
> dealing with them.

Product-scanned multiply only needs multiply-and-accumulate, which is
common in DSPs, with a wide accumulator. So a RISC-V DSP extension
could support this.

>> Previous discussion on this thread about add-with-carry mentioned
>> that WIDE-ADD could be done using an internal carry flag and a form
>> of macro-op fusion, but that worked because the intermediate carry
>> was overwritten with the final result. I think the problem with
>> UMAAL is that it requires four read and two write ports on the
>> register file, which ARM implementations are likely to have for some
>> of the other complex ARM opcodes, but RISC-V tries very hard to
>> minimize the required number of register file ports and so far has
>> kept them to "2 read/1 write", at which the project seems to be
>> intent on holding the line.
>
> This is certainly a problem.
>
> In most practical UMAAL loops, the hi will be forwarded to one of the
> next accumulators, and one of the x,y (let’s say x) will be repeated
> many times. This allows a 2->1 pattern in the inner loop in the
> common case. But it’s not clear how to implement it with macro-op
> fusion since it’s 8 instructions. Perhaps there is a shorter sequence
> of 2->1 instructions (which might not yet be RISC-V instructions), and
> we could spec those and then use fusion.

If the carries can be clobbered, macro-op fusion could reduce your 8
instructions to "FUSED-MUL; WIDE-ADD; WIDE-ADD". Since the WIDE-ADDs
would clobber the result of the FUSED-MUL, the overall sequence need
only do two regfile writes, if {hi, lo} can be kept "up in the air"
inside the ALU. But this requires getting rid of "c" somehow.

>> The problem with the simple vector-add-bignum is that carries must
>> propagate between vector lanes.
>
> This is certainly a problem. There is a legitimate discussion to be
> had on whether this sort of reducing operation has a place in RISC-V’s
> V extension.

Rats: Reducing operations (like the vector-bignum-reduce I suggested)
require entire operands to be passed between lanes, not just carries. A
vector-rotate-elements operation also requires entire values be passed
between lanes. The only way in which passing values is easier than
propagating carries is that the dependency chain is the single preceding
element rather than all preceding elements.

Double rats: Unless you know some trick that I do not, I think that
reducing operations are also necessary for efficient bignum arithmetic
using vectors. (I am talking about the case where the operands are much
larger than even the vector registers.)

>> I have been thinking about this, and vector-fused-multiply-add is
>> very expensive to encode, requiring an entire quadrant of a major
>> opcode just to represent the operands, and that is assuming that
>> arithmetic mode (integer/fixed/float) is part of the vector unit
>> state. (Each of three inputs can be a vector register or a scalar
>> register, so 3*6=18 bits for input registers and 5 bits for the
>> destination vector comes to 23 bits out of 25 available within a
>> major opcode in the base 32-bit encoding.) Introducing variants of
>> VFMA (as a previous draft of this reply suggested) is a non-starter.
>
> I’ll be impressed if the V extension avoids going to wider encodings.

The second slide in that PDF (riscv-vector-workshop-june2015.pdf) lists
as one of the goals for V: Fit into standard fixed 32-bit encoding space

>> Propagating carries between vector lanes would add considerable
>> complexity and still leave the problem of carry-in/carry-out
>> unsolved. Would some form of vector-save-carry that causes a
>> following vector addition or vector-fused-multiply-add to save carry
>> chain information into another vector register perhaps help? The
>> broad outline I am planning to propose for the vector group to
>> consider would have a vector-process-carry instruction that converts
>> a carry chain vector into a vector of integers that can be added to
>> the result of a vector addition to perform (part of) a bignum
>> addition. The vector-process-carry instruction could have its own
>> functional unit that should be relatively small and the carry-chain
>> information should fit in 8-bit elements, or possibly smaller. I
>> have not worked out the math to determine if carry chains could be
>> meaningfully combined before processing.
>
> The carries themselves can be 1-bit elements. Then each word will
> need to generate a bit that says whether it is all 1s. Computing the
> final carry chain is something really simple, like (carries +
> propagators) ^ propagators, and can be tapped out from the internal
> structure of an adder that computes (carries + propagators) IIRC.
> Then you have to apply the result, and possibly compute a final carry
> out.

While chaining multiple additions before adding carries is of little, if
any, benefit to operand-scanned multiply, I hate the idea of wasting
bits and only using 2 bits when the smallest vector elements are 8-bit
seems wasteful. I have been thinking about this, and I think I have a
solution by considering (vector unit) addition as base-(2^N) rather than
base-2. The carry chain information ("carry state vector") is held in
an ordinary vector register and consists of 8-bit elements, each holding
two 4-bit fields. One field is the accumulated count of carry-outs from
this lane and the other is the number of carry-ins that this lane can
accept before producing a new carry-out. Vector addition increments the
carry-out-count every time a carry out of this lane is produced, while
simply overwriting the carry-in-limit with a value dependent on the
result in this lane.

Some examples using base-10 and "schoolbook addition" may help to explain:

11 1
99 9
+ 99 + 9
---- ---
198 18 Familiar enough?

Now with columns as vector lanes (and no carry between lanes during addition):

9
+ 9
----
8 carry: (1,1),[0] -> 10
+ 10
----
18

99
+ 99
-----
88 carry: (1,1),(1,1),[0] -> 110
+ 110
-----
198 == 2 * 99

99
+ 99
-----
88 carry: (1,1),(1,1)
+ 99
-----
77 carry: (2,2),(2,2),[0] -> 220
+ 220
-----
297 == 3 * 99

99
+ 99
-----
88 carry: (1,1),(1,1)
+ 99
-----
77 carry: (2,2),(2,2)
+ 3
-----
70 carry: (2,2),(3,9),[0] -> 330
+ 330
-----
300 == 3 * 99 + 3


The result of vector-process-carry is indicated using "->" in the
examples above. The concept here is that vector-process-carry can have
its own functional unit in the vector subsystem, avoiding the need for
the primary vector lanes to handle inter-lane carries. I also expect
that vector-process-carry would use scalar registers as the low-order
carry-in and the high-order carry-out. The lack of a carry-in was
represented by the "[0]" elements when carries were processed in the
examples above.

Using 4-bit carry accumulators requires that carries be processed after
at most 14 (or maybe 15) additions, but it also ensures that processing
carries can never produce a result that would need further carries. The
carry processing element can do everything in parallel--each lane's
carry-in-limit is compared with its carry-in value and an additional
carry-out produced if needed. The result of the calculation using
element N is placed in element N+1. The calculation is: E_{i+1} = C_i
+ ((C_{i-1} > L_i) ? 1 : 0) where E is the result vector, C is the
carry-out field of an element of the carry state vector, L is the
carry-in-limit field of an element of the carry state vector, and the
subscripts denote vector elements. Element 0 in the result is the
chained carry-in from a scalar register, while the highest element of
the result (which does not fit in the vector) is the chained carry-out
and is placed into a scalar register. To save encoding space,
vector-process-carry could use a single scalar register for both
carry-in and carry-out.

While the arithmetic done in the carry processing element could be
integrated into the vector lanes with a vector-add-bignum operation, it
supports a relatively specialized application and the need for carry
chaining between additions means that vector-add-bignum would be a
4-operand instruction. While the enormous encoding cost of a 4-operand
instruction can be justified for VFMA, it cannot be justified for the
less widely useful vector-add-bignum. Further, the need for handling
carries would not apply only to vector-add-bignum, but also to VFMA when
applied to bignums, turning bignum-VFMA into a 5-operand instruction,
which simply does not fit in the available encoding.

The solution suggested with vector-save-carry/vector-process-carry is to
add architectural state to the vector unit that provides an implicit
extra operand to vector addition, including the addition inside VFMA.
The added field designates a vector register where the carry state
vector is stored and is reset after every addition. The intent is for
implementations to use macro-op fusion to combine vector-save-carry with
the following vector addition. An alternative, if the carry chain logic
is integrated into the vector lanes, would be for this additional state
to instead reference the scalar register used for chaining carries
between vectors, but having an implicit vector operand can be strange
enough and vector operations implicitly using a scalar register could be
even more confusing.

>> What seems to be the current approach would allow VFMA to do an
>> expanding multiply, by using a vector of 64-bit elements as the
>> destination while the source operands have 32-bit elements. To use
>> this for operand-scanned multiply, a vector-reduce-bignum operation
>> would be needed that adds the high half of each element to the low
>> half of the next element, probably saving carry information between
>> the result elements, like any other vector addition, reducing the
>> element size by half, dropping the least significant element of the
>> result (which was the low half of the least significant element of
>> the input and was unchanged), and shifting the result downwards by
>> one element, thus opening room for the high half of the highest
>> element to be the highest element of the result.
>
> Yes, something like this will be required. It’s also possible to have
> one instruction compute the low half, and another the high half.
> Also, it is possible to defer the carries during the loop body at the
> cost of more registers and bandwidth, but some kind of cross-lane
> access is still required.

Since vector element widths are bound to vector register numbers, the
expanding multiply has zero encoding cost. Variants of VFMA for
high-half/low-half are not going to fit in the instruction space.

If vector-reduce-bignum is implemented by passing the low halves down,
the only required inter-lane traffic is one input operand (the low half
of the next element).


-- Jacob
Reply all
Reply to author
Forward
0 new messages