Mike Hamburg wrote:
>
>> On Sep 26, 2016, at 4:01 PM, Jacob Bachmeyer <
jcb6...@gmail.com
You are correct in the general case, but useless branches as an
exception to a general non-interference rule just struck me as highly
odd, unlikely to help an implementation, and almost tailor-made for
quiet subversion. On the other hand, I believe that they were honestly
included in the list, since meaningful branches *do* leak their
conditionals, and having the same opcode be permitted to leak state in
one case and required to be leak-proof in another case also seems very
odd until subversion is considered.
> Making useless branches into NOPs is only one way to do it, and if
> checking this adds a lot of overhead in hardware I don’t think it
> should be mandated.
A useless branch can be recognized by the instruction decoder: the
offset is the length of the branch instruction. (RISC-V user spec v2.1;
pages 16, 17, 85; PDF pages 25, 26, 95) I agree that requiring them to
be treated as NOPs may be a bit too strong, but state details that do
not affect control flow should generally not be allowed to leak. After
all, the inputs to a multiplication are "state details that do not
affect control flow". :)
> For super-high-level certification, you need really strong isolation
> guarantees. The simplest step would be to make branch prediction
> explicitly a per-thread resource, but you might also need isolated
> caches, timers, etc.
If that level of isolation guarantee is as detrimental to performance as
it looks, it should be an extension and should not be rolled into "G".
On the other hand, "G" should be "safe enough" for a network-connected
workstation in a reasonably secure (attacker has no physical access)
environment.
>> I get the impression that the RISC-V project envisions dedicated
>> cryptographic accelerators and may not understand the mistrust of
>> such "black boxes" in the security community.
>
> Beyond the mistrust, there’s simply the problem that not everyone will
> port their code to the accelerator.
And the problem that different implementations will almost certainly
have different accelerators, which makes porting even less likely.
Patents expire, but "infeasible to implement on any other architecture"
could be forever. :)
> For product scanning (schoolbook by columns instead of by rows), the
> inner loop is (x*y) + {hi,lo} -> {hi,lo}, which in ARM parlance would
> be UMLAL. Product scanning is less regular than operand scanning, but
> requires fewer memory writes if your regfile isn’t big enough.
> Unfortunately, UMLAL can carry, which adds more complexity. However,
> it is a good match for an extension that adds eg a dedicated 3-word
> accumulator register, where the 3rd word absorbs the carries. It is
> also a good match for reduced-radix operations, where internally the
> numbers are represented in eg base 2^28 instead of 2^32, and the upper
> bits are 0 to allow some number of carries to accumulate before
> dealing with them.
Product-scanned multiply only needs multiply-and-accumulate, which is
common in DSPs, with a wide accumulator. So a RISC-V DSP extension
could support this.
>> Previous discussion on this thread about add-with-carry mentioned
>> that WIDE-ADD could be done using an internal carry flag and a form
>> of macro-op fusion, but that worked because the intermediate carry
>> was overwritten with the final result. I think the problem with
>> UMAAL is that it requires four read and two write ports on the
>> register file, which ARM implementations are likely to have for some
>> of the other complex ARM opcodes, but RISC-V tries very hard to
>> minimize the required number of register file ports and so far has
>> kept them to "2 read/1 write", at which the project seems to be
>> intent on holding the line.
>
> This is certainly a problem.
>
> In most practical UMAAL loops, the hi will be forwarded to one of the
> next accumulators, and one of the x,y (let’s say x) will be repeated
> many times. This allows a 2->1 pattern in the inner loop in the
> common case. But it’s not clear how to implement it with macro-op
> fusion since it’s 8 instructions. Perhaps there is a shorter sequence
> of 2->1 instructions (which might not yet be RISC-V instructions), and
> we could spec those and then use fusion.
If the carries can be clobbered, macro-op fusion could reduce your 8
instructions to "FUSED-MUL; WIDE-ADD; WIDE-ADD". Since the WIDE-ADDs
would clobber the result of the FUSED-MUL, the overall sequence need
only do two regfile writes, if {hi, lo} can be kept "up in the air"
inside the ALU. But this requires getting rid of "c" somehow.
>> The problem with the simple vector-add-bignum is that carries must
>> propagate between vector lanes.
>
> This is certainly a problem. There is a legitimate discussion to be
> had on whether this sort of reducing operation has a place in RISC-V’s
> V extension.
Rats: Reducing operations (like the vector-bignum-reduce I suggested)
require entire operands to be passed between lanes, not just carries. A
vector-rotate-elements operation also requires entire values be passed
between lanes. The only way in which passing values is easier than
propagating carries is that the dependency chain is the single preceding
element rather than all preceding elements.
Double rats: Unless you know some trick that I do not, I think that
reducing operations are also necessary for efficient bignum arithmetic
using vectors. (I am talking about the case where the operands are much
larger than even the vector registers.)
>> I have been thinking about this, and vector-fused-multiply-add is
>> very expensive to encode, requiring an entire quadrant of a major
>> opcode just to represent the operands, and that is assuming that
>> arithmetic mode (integer/fixed/float) is part of the vector unit
>> state. (Each of three inputs can be a vector register or a scalar
>> register, so 3*6=18 bits for input registers and 5 bits for the
>> destination vector comes to 23 bits out of 25 available within a
>> major opcode in the base 32-bit encoding.) Introducing variants of
>> VFMA (as a previous draft of this reply suggested) is a non-starter.
>
> I’ll be impressed if the V extension avoids going to wider encodings.
The second slide in that PDF (riscv-vector-workshop-june2015.pdf) lists
as one of the goals for V: Fit into standard fixed 32-bit encoding space
>> Propagating carries between vector lanes would add considerable
>> complexity and still leave the problem of carry-in/carry-out
>> unsolved. Would some form of vector-save-carry that causes a
>> following vector addition or vector-fused-multiply-add to save carry
>> chain information into another vector register perhaps help? The
>> broad outline I am planning to propose for the vector group to
>> consider would have a vector-process-carry instruction that converts
>> a carry chain vector into a vector of integers that can be added to
>> the result of a vector addition to perform (part of) a bignum
>> addition. The vector-process-carry instruction could have its own
>> functional unit that should be relatively small and the carry-chain
>> information should fit in 8-bit elements, or possibly smaller. I
>> have not worked out the math to determine if carry chains could be
>> meaningfully combined before processing.
>
> The carries themselves can be 1-bit elements. Then each word will
> need to generate a bit that says whether it is all 1s. Computing the
> final carry chain is something really simple, like (carries +
> propagators) ^ propagators, and can be tapped out from the internal
> structure of an adder that computes (carries + propagators) IIRC.
> Then you have to apply the result, and possibly compute a final carry
> out.
While chaining multiple additions before adding carries is of little, if
any, benefit to operand-scanned multiply, I hate the idea of wasting
bits and only using 2 bits when the smallest vector elements are 8-bit
seems wasteful. I have been thinking about this, and I think I have a
solution by considering (vector unit) addition as base-(2^N) rather than
base-2. The carry chain information ("carry state vector") is held in
an ordinary vector register and consists of 8-bit elements, each holding
two 4-bit fields. One field is the accumulated count of carry-outs from
this lane and the other is the number of carry-ins that this lane can
accept before producing a new carry-out. Vector addition increments the
carry-out-count every time a carry out of this lane is produced, while
simply overwriting the carry-in-limit with a value dependent on the
result in this lane.
Some examples using base-10 and "schoolbook addition" may help to explain:
11 1
99 9
+ 99 + 9
---- ---
198 18 Familiar enough?
Now with columns as vector lanes (and no carry between lanes during addition):
9
+ 9
----
8 carry: (1,1),[0] -> 10
+ 10
----
18
99
+ 99
-----
88 carry: (1,1),(1,1),[0] -> 110
+ 110
-----
198 == 2 * 99
99
+ 99
-----
88 carry: (1,1),(1,1)
+ 99
-----
77 carry: (2,2),(2,2),[0] -> 220
+ 220
-----
297 == 3 * 99
99
+ 99
-----
88 carry: (1,1),(1,1)
+ 99
-----
77 carry: (2,2),(2,2)
+ 3
-----
70 carry: (2,2),(3,9),[0] -> 330
+ 330
-----
300 == 3 * 99 + 3
The result of vector-process-carry is indicated using "->" in the
examples above. The concept here is that vector-process-carry can have
its own functional unit in the vector subsystem, avoiding the need for
the primary vector lanes to handle inter-lane carries. I also expect
that vector-process-carry would use scalar registers as the low-order
carry-in and the high-order carry-out. The lack of a carry-in was
represented by the "[0]" elements when carries were processed in the
examples above.
Using 4-bit carry accumulators requires that carries be processed after
at most 14 (or maybe 15) additions, but it also ensures that processing
carries can never produce a result that would need further carries. The
carry processing element can do everything in parallel--each lane's
carry-in-limit is compared with its carry-in value and an additional
carry-out produced if needed. The result of the calculation using
element N is placed in element N+1. The calculation is: E_{i+1} = C_i
+ ((C_{i-1} > L_i) ? 1 : 0) where E is the result vector, C is the
carry-out field of an element of the carry state vector, L is the
carry-in-limit field of an element of the carry state vector, and the
subscripts denote vector elements. Element 0 in the result is the
chained carry-in from a scalar register, while the highest element of
the result (which does not fit in the vector) is the chained carry-out
and is placed into a scalar register. To save encoding space,
vector-process-carry could use a single scalar register for both
carry-in and carry-out.
While the arithmetic done in the carry processing element could be
integrated into the vector lanes with a vector-add-bignum operation, it
supports a relatively specialized application and the need for carry
chaining between additions means that vector-add-bignum would be a
4-operand instruction. While the enormous encoding cost of a 4-operand
instruction can be justified for VFMA, it cannot be justified for the
less widely useful vector-add-bignum. Further, the need for handling
carries would not apply only to vector-add-bignum, but also to VFMA when
applied to bignums, turning bignum-VFMA into a 5-operand instruction,
which simply does not fit in the available encoding.
The solution suggested with vector-save-carry/vector-process-carry is to
add architectural state to the vector unit that provides an implicit
extra operand to vector addition, including the addition inside VFMA.
The added field designates a vector register where the carry state
vector is stored and is reset after every addition. The intent is for
implementations to use macro-op fusion to combine vector-save-carry with
the following vector addition. An alternative, if the carry chain logic
is integrated into the vector lanes, would be for this additional state
to instead reference the scalar register used for chaining carries
between vectors, but having an implicit vector operand can be strange
enough and vector operations implicitly using a scalar register could be
even more confusing.
>> What seems to be the current approach would allow VFMA to do an
>> expanding multiply, by using a vector of 64-bit elements as the
>> destination while the source operands have 32-bit elements. To use
>> this for operand-scanned multiply, a vector-reduce-bignum operation
>> would be needed that adds the high half of each element to the low
>> half of the next element, probably saving carry information between
>> the result elements, like any other vector addition, reducing the
>> element size by half, dropping the least significant element of the
>> result (which was the low half of the least significant element of
>> the input and was unchanged), and shifting the result downwards by
>> one element, thus opening room for the high half of the highest
>> element to be the highest element of the result.
>
> Yes, something like this will be required. It’s also possible to have
> one instruction compute the low half, and another the high half.
> Also, it is possible to defer the carries during the loop body at the
> cost of more registers and bandwidth, but some kind of cross-lane
> access is still required.
Since vector element widths are bound to vector register numbers, the
expanding multiply has zero encoding cost. Variants of VFMA for
high-half/low-half are not going to fit in the instruction space.
If vector-reduce-bignum is implemented by passing the low halves down,
the only required inter-lane traffic is one input operand (the low half
of the next element).
-- Jacob