This was forgotten about in my "Drafts" folder since late October 2016.
(oops!)
I believe that my work on encoding VFMA and "packing" other basic
operations (predicate ops, VADD, VMUL, VSUB, VCOPY; VSPLAT is a special
case of VMUL) around it is most likely to be useful. The vector
configuration that I propose is probably inferior to RVV0.2, but an
explicit VSETVL rather than a magic CSR might be useful.
Since "V" is now very much on its way to standardization, but still
lacks an instruction encoding, I decided to just send this now:
[Feel free to use any pieces of this that appear useful.]
[This was written before mbadaddr was changed to mtval. I have decided
not to edit this unless asked.]
A recent discussion on this list got me thinking about a possible
encoding for the "V" extension. Consider this proposal to be a very
rough and incomplete draft. I will make a more concrete proposal if
there is interest. This proposal is derived from the slides presented
at the 2nd RISC-V workshop, my (admittedly limited) personal experience
with SIMD programming and some past discussions on this list. The
vector instruction list is derived from Hwacha and previous discussions
on this list, but the execution model is my own doing and is different
from Hwacha and more closely follows the traditional vector model. Any
holes in it are most definitely mine. :)
It may be worth noting that I independently arrived at the concept of
using the funct3 field to select between vector and scalar operands
before reading the Hwacha manual. I think that this supports the notion
that this use of funct3 for vector instructions is a good idea.
=Vector Execution and Memory Models=
The vector unit executes instructions that are fed to it by the scalar
unit. The vector unit may use an instruction queue, permitting vector
operations to be queued and scalar execution to immediately continue.
As a result, the vector unit may lag behind the scalar unit by up to an
implementation-defined queue depth. The vector unit is not permitted to
execute instructions ahead of the scalar unit's program counter.
The vector unit, when active, has two halt states, "queue empty", which
is the normal idle state, and "execution pending", which is the state of
the vector unit when the vector unit causes a fault or is interrupted by
privileged code. User mode never sees "execution pending", since vector
execution automatically resumes upon return to user mode. Once stopped,
either by a fault or explicit privileged VSYNC, the vector unit remains
halted until scalar execution returns to user mode. (Note that this
implies that the vector extension requires U-mode to be implemented.
This will probably need to be changed.)
The vector unit uses virtual addresses in exactly the same way as the
scalar unit with identical translations. Vector loads and stores may
lag behind scalar loads and stores until one of the vector
synchronization instructions (VSYNC, VFENCE,
VFENCE.IN, VFENCE.OUT) is
executed. Vector accesses see other vector accesses in program order,
but, due to "vector lag", may observe scalar stores "from the future".
Similarly, "vector lag" may cause scalar loads to observe old values
instead of the result of a pending vector store. The vector
synchronization instructions prevent these irregularities. The base ISA
FENCE instructions ensure that the scalar unit's view of memory is
consistent with other harts, but have no direct effect on the vector
unit. Ensuring that the results of a vector calculation are visible to
other harts requires a "VFENCE.OUT; FENCE" sequence.
The exact order in which elements in non-overlapping vector loads and
stores are accessed is unspecified, and multiple vector loads and stores
may be interleaved depending on the implementation, provided that vector
accesses see other vector accesses in program order. Vector accesses to
non-idempotent memory regions (such as MMIO) must be in element order
starting from the lowest element, but may have significantly reduced
performance compared with accesses to main memory. Non-overlapping
accesses to non-idempotent regions may be interleaved, depending on the
implementation.
Because "vector lag" is permitted, faults caused by the vector unit
cannot be guaranteed to be precise and two new cause codes are
introduced for "vector load access fault" and "vector store access
fault". The vector unit is required to be able to handle misaligned
loads and stores, but element accesses are not required to be atomic in
that case and performance may be significantly reduced. Misaligned
vector access to non-idempotent memory produces implementation-defined
results. An implementation that does precisely trap vector access
faults may treat vector faults as ordinary load and store faults and may
trap on misaligned vector memory accesses, with a severe loss of
performance. An imprecise vector access fault loads mbadaddr with the
address that caused the fault at the vector unit but loads mepc with
whatever subsequent instruction the scalar unit was about to begin to
execute when the vector unit raised the fault, permitting execution to
be continued.
{TODO: consider "vector status registers", one of which would contain
the address of a faulting vector instruction?}
Except for the vector synchronization instructions, all vector
instructions are "queueable" and the vector unit remembers the privilege
mode for every instruction queued. Changes in privilege mode are not
required to flush the vector instruction queue. The vector unit can
continue executing queued user operations (with user privilege) even
while the scalar unit is executing in a privileged mode. The vector
unit does not remember the values of sptbr or other
address-translation-control CSRs. A supervisor (or hypervisor or
monitor) must execute "VCXSAVE; VSYNC" during a context switch to ensure
that all pending vector operations complete before the memory subsystem
is reconfigured. A privileged VSYNC can interrupt the vector unit and
force a pending privileged VCXSAVE to be immediately executed, saving an
intermediate vector state that may include partially-executed
operations. Restoring such a state causes the vector unit to resume
execution "where it left off" upon return to user mode.
Any vector instruction may perform an implicit VSYNC or VFENCE (any
fence) depending on the implementation. An implicit VSYNC or VFENCE may
have partial effect, depending on the implementation. A vector
instruction that produces a result in a scalar register must ensure that
its result is visible to the program before scalar execution continues.
This does not require an immediate implicit VSYNC, since an
implementation is permitted to set an internal flag and only wait for
only that result when the destination scalar register is read by a
subsequent instruction.
{TODO: how does this interact with context switch? VCXSAVE can
interrupt vector execution, leaving the "pending scalar result" flag
twisting in the wind}
The vector unit state can be saved and restored using VCXSAVE
("vector-context-save") and VCXRESTORE ("vector-context-restore")
instructions. These operations behave differently in user and
privileged modes. In user mode, these are queued like any other vector
operation and save/restore all user-visible vector state. This allows
the vector state to be saved and restored around functions or function
calls. In privileged modes, these operations are queued if the vector
unit is running, executed immediately if the vector unit is idle, and
executed immediately ("pulled to top-of-queue") if privileged VSYNC is
executed while queued. Privileged VCXSAVE and VCXRESTORE can
additionally save/restore the contents of the vector instruction queue
and other internal state of the vector unit not visible to user code.
The vector unit is in "execution pending" state after a privileged
VCXSAVE or VCXRESTORE. The size of a vector state object is found in
the configuration string. VCXSAVE and VCXRESTORE may have special
alignment requirements, or the vector state object may contain internal
alignment padding. Any required alignment for a vector context must be
consistent with page-alignment, but vector contexts may be aligned on
less than full pages. For portable software, a vector context must be
restored from the same page offset at which it was saved.
==Vector modifiers==
Most vector instructions can be affected by various modifiers.
Modifiers are state internal to the vector unit that are set using
special instructions in the V3-S-CONTROL-PREFIX group and implicitly
cleared after each use. Modifiers are cleared only by instructions that
actually use them; other vector operations have no effect on pending
modifiers. These "vector prefix" or "vector modifier" opcodes are
distinct instructions from the vector instructions whose execution they
affect, although software should group them to permit macro-op fusion.
Pending vector modifiers are saved and restored as part of the overall
vector unit state. Pending vector modifiers can be discarded using any
instruction encoded using VSETVL, including VGETVL, VRESET, and the
special-purpose vector-clear-modifiers encoded as "VSETVL x0, x0".
This proposal envisions predication as a vector modifier, instead of
using an implicit predicate. For programs that use the same predicates
for a series of vector operations, this is less optimal, but for
programs that use many different predicates, this saves about half of
the vector-predicate-swap instructions that would be needed. This also
allows instructions that do not need predicates to not use them,
removing the need to maintain an "all-active" predicate for this case.
Further, the vector-use-predicate opcode has space to encode an "invert
predicate" flag, as in Hwacha.
An additional vector modifier proposed is vector-save-carry, which
enables an ordinary vector-add to be used to implement bignum addition.
I posted details of this in a previous message. The examples are
lengthy, even though few.
=Opcode OP-VECTOR=
This proposed encoding uses a single major opcode to encode all vector
operations, dividing the major opcode into four quadrants and adapting
the R4 instruction format used for floating-point fused-multiply-add.
All vector instructions set the opcode field in bits 6:0 to OP-VECTOR
(which is to-be-defined). All vector instructions with a destination
register use the common rd field in bits 11:7 to specify the destination
register number. The common funct3 field (bits 14:12) is used to select
whether the registers referenced are in the vector or scalar register
file. Which scalar register file is used is determined by the vector
unit arithmetic mode
(integer/fixed-point-integer/fixed-point-float/floating-point) or by an
independent field in the vector unit state (to be determined). The
common rs1 (bits 19:15) field is used for the first source register
number in all vector instructions. The common rs2 field (bits 24:20)
supplies a second source register number in some vector instructions and
the rs3 field in the R4 format (bits 31:27) supplies a third source
register number in vector-fused-multiply-add and related instructions.
The fmt field in the R4 format (bits 26:25) is the "vector instruction
quadrant" field in all vector instructions. A quadrant can encode a
single fully-general 4-operand instruction, 32~64 3-operand
instructions, or a large number of 2-operand instructions. Unused
operand vector/scalar selection bits can be reused to distinguish
instructions, provided that all instructions so grouped place
requirements on the same operands.
Operations on vector predicates are not predicated, are unconditionally
performed on all elements, and do not clear a pending predicate
modifier. Operations that do not involve addition do not use or clear a
pending save-carry modifier.
==V3==
Quadrant V3 (bits 26:25 == 2'b11) primarily encodes VFMA, which is a
4-operand instruction and fills 3/4 of the quadrant. The destination of
VFMA must be a vector, so the destination selection bit is instead
reused to indicate whether the third source operand is a scalar or a
vector. VFMA performs element-wise (rs1*rs2)+rs3 into a vector
register. The first source register is the multiplicand, the second
source is the multiplier, and the third source is the addend. Using the
x0 (but not the f0 register!) register in VFMA produces special cases:
If the addend is the x0 register, the addition is skipped and VFMA also
encodes a vector-multiply instruction. If the multiplier is the x0
register, then VFMA also encodes a vector-add instruction rs1+rs3. If
the multiplicand is the x0 register, then VFMA also encodes a
vector-subtract instruction rs3-rs2. If the multiplicand and addend are
both x0, the vector-subtract subtracts the multiplier from zero yielding
-rs2. If both factors are x0, VFMA also encodes a vector-copy
instruction that loads a vector with rs3, expanding or rounding as needed.
Due to the limited number of read ports on the integer register file and
the existence of a floating-point fused-multiply-add that would make
VFMA on three scalar floating-point registers redundant, VFMA may have
at most two scalar operands. To limit decoding complexity, if rs3 is a
scalar, rs2 must be either x0 (since this encodes vector-add) or a
vector. This allows one integer read port to be assigned statically to
rs1 and the other to be used for either rs2 or rs3. Because
multiplication is commutative, this restriction does not limit the power
of VFMA . The two forbidden combinations of vector/scalar operands
instead provide encoding space for instructions that require each
particular operand to be either a vector or a scalar. When all three
operands are selected as scalar registers, the rs3 field becomes a
funct5 field, providing 32 encoding slots for 3-operand instructions.
This region is known as V3-S. The case where rs3 and rs2 are both
scalars is used for vector predicate operations, since vector predicate
operations are decoded in the vector unit. This region is known as V3-R
and is not further decoded by the scalar unit. Instead, a
vector-predicate-arithmetic symbol and the 20 bits from the rd, rs1,
rs2, and rs3 fields are simply passed to the vector unit.
This multiplexing on the combination of vector/scalar operands may be
able to reuse some of the existing decode logic associated with the
funct3 field in R, I, and S instruction formats.
A summary table of vector/scalar operand multiplexing in V3:
FMA V*V+V
FMA V*V+S
!! rs3=x0 --> VMUL V*V
FMA V*S+V
!! rs2=x0 --> VADD V+V
XTND V*S+S --> V3-R (vector predicate arithmetic)
FMA S*V+V
!! rs1=x0 --> VSUB V-V
FMA S*V+S
!! rs1=x0 --> VSUB S-V
!! rs3=x0 --> VMUL S*V = V*S
!! rs1=rs3=x0 -> VNEG V
FMA S*S+V
!! rs1=x0 --> VSUB V-S
!! rs2=x0 --> VADD S+V = V+S
!! rs1=rs2=x0 --> VCOPY V
XTND S*S+S --> V3-S
!! rs3=x0 --> VMUL S*S (due to explicit definition in V3-S)
The only operations missing are vector-add-splat and
vector-subtract-splat with two scalar inputs. Is there any reason to
have these?
===V3-R===
V3-R contains vector predicate arithmetic exactly as provided in Hwacha,
but with only the eight predicate registers planned for V. A 4-operand
vector predicate operation requires 12 bits to select three sources and
a destination and 8 bits to define the operation, for a total of 20
bits. V3-R is a 20-bit instruction space, treated as opaque in the
scalar unit. This fits exactly.
===end V3-R===
===V3-S=== (function code in rs3; operands rd, rs1, rs2)
V3-S contains various operations and some slots are further divided into
subslots. V3-S contains these slots: V3-S-CONTROL (which contains
various vector unit control operations)
A vector-multiply-splat operation is assigned function code zero in
V3-S, since this is the regular encoding for vector-multiply with two
scalar operands (as "VFMA rd, rs1, rs2, x0"). This instruction provides
an expanding multiply on RV32GV and RV64GQV, where the vector unit can
handle elements wider than the scalar registers.
====V3-S-CONTROL==== (minor function code in rs2; operands in rd, rs1)
V3-S-CONTROL contains control operations that affect the vector unit and
is divided into 32 2-operand subslots with rs2 serving as a
function-select field. The rs1 field selects the source register and
the rd field selects the destination register.
VSETCFG-STEP is in V3-S-CONTROL and uses the rs1 and rd fields as
immediate values. VSETCFG-STEP uses rd as a base-2 logarithm to
indicate a vector element length in bits and rs1 to request a number of
architectural registers with that element length. If rs1 is zero, 32
vectors of length indicated by rd are requested. VSETCFG-STEP writes to
internal vector unit state that may be visible as CSRs and which is
checked for consistency (no more than 32 registers total requested) at
the following VSETVL or at the end of a sequence of VSETCFG-STEP. This
means that VSETVL must be used after VSETCFG to complete the vector unit
initialization and that an invalid vector unit configuration is not
required to throw an exception until VSETVL is executed. VSETCFG is an
assembler pseudo-instruction that expands into the required VSETCFG-STEP
instructions. VSETCFG-STEP instructions are only generated for element
lengths that are actually requested. Implementations may, of course,
use macro-op fusion to combine a sequence of VSETCFG-STEP into a single
VSETCFG and may trap invalid vector configurations at VSETCFG if they
can reliably detect them. After the vector unit is configured, it
cannot be configured again until either VUNCFG or VRESET are executed.
After VSETCFG, all vector registers are cleared to zero, but the vector
registers may be cleared by VUNCFG or VRESET or any time in between
VUNCFG/VRESET and VSETCFG.
VSETVL is in V3-S-CONTROL and requires two integer scalar registers as
operands. VSETVL always accesses the integer register file regardless
of the vector unit state. VSETVL reads the application-requested vector
length from the source register and sets the internal vector length to
the smaller of the requested length and the maximum supported by the
vector unit in the current configuration, placing the new vector length
into the destination register. After the vector length is changed, the
contents of the vector registers are undefined. VSETVL with the source
operand x0 does not change the vector length and instead writes the
current vector length to the destination register, encoding VGETVL.
VSETVL with a destination of x0 is used to encode zero-operand
instructions with the rs1 field serving as an additional function-select
field. This region is known as V3-S-CONTROL-VSETVL. All instructions
encoded using VSETVL, even VGETVL, clear all pending vector modifiers.
=====V3-S-CONTROL-VSETVL===== (micro function code in rs1; no operands;
selected when rd is x0)
The vector-clear-modifiers instruction is assigned function code zero in
V3-S-CONTROL-VSETVL and only clears pending vector modifiers with no
other effects.
VUNCFG is in V3-S-CONTROL-VSETVL and unconfigures the vector unit,
potentially turning off power to the vector unit, depending on the
implementation. After VUNCFG, VSETCFG must be used to activate and
configure the vector unit before any other vector instructions can be used.
VRESET is in V3-S-CONTROL-VSETVL and resets the vector unit to the
initial state, but does not take any action (such as cutting power) that
would delay immediately reconfiguring the vector unit. After VRESET,
VSETCFG must be used to reconfigure the vector unit before any other
vector instructions can be used. If the vector unit is off, VRESET
activates it, but leaves it in the initial state.
TO BE DETERMINED: Should VSETCFG expand into VRESET, followed by the
needed VSETCFG-STEP instructions? This could simplify VSETCFG-STEP,
since VRESET would ensure that the vector unit is on.
The vector synchronization instructions VSYNC, VFENCE,
VFENCE.IN, and
VFENCE.OUT are all in V3-S-CONTROL-VSETVL and provide varying
synchronization guarantees. Unlike other vector operations, the vector
synchronization instructions are valid even if the vector unit is off,
in which case they reduce to effective NOPs. The
VFENCE.IN operation
ensures that all vector loads preceding
VFENCE.IN complete before any
scalar store after
VFENCE.IN begins. The VFENCE.OUT operation ensures
that all vector stores preceding VFENCE.OUT complete before any scalar
load after VFENCE.OUT begins. The VFENCE operation combines
VFENCE.IN
and VFENCE.OUT and goes slightly further, ensuring that all pending
vector memory accesses complete before any subsequent scalar memory
access begins. The VSYNC operation is the strongest vector
synchronization instruction and ensures that the vector unit is idle
before scalar execution continues. Only VFENCE and VSYNC order vector
stores with respect to scalar stores.
=====end V3-S-CONTROL-VSETVL=====
A vector-save-carry operation is in V3-S-CONTROL and requires two vector
operands. The vector-save-carry operation writes to internal vector
unit state that provides implicit operands to a following vector
addition for saving inter-lane carries and updating a previously saved
carry chain. This supports using the vector unit for bignum
arithmetic. The internal state written by vector-save-carry is reset
after it is used. Implementations are expected to use macro-op fusion
to combine vector-save-carry with a following vector addition,
potentially turning VFMA into a 6-operand instruction. The
corresponding vector-process-carry that converts the carry-chain state
into a vector that can be added to the result of vector addition to
complete part of a bignum addition also requires two operands and
physically fits well in V3-S-CONTROL, but logically belongs elsewhere,
since it does not affect vector unit control state.
====end V3-S-CONTROL====
The vector-splat operation (initializing all elements of a vector from a
scalar register; a special case of vector-copy required due to the
presence of V3-S in the encoding) is a 2-operand instruction. The
assembler is expected to translate vector-copy with a scalar source to
vector-splat.
===end V3-S===
==end V3==
{TODO: comparisons?}
Quadrants V0, V1, and V2 provide encoding space for 3-operand vector
instructions. Each uses rs3/funct5 as a function code. The funct5 and
vector instruction quadrant fields may be treated as a combined funct7
field if this simplifies decoding. This region is known as VO and
contains 96 encoding slots for fully-general three-operand
instructions. Each slot may be divided among instructions that impose
common vector/scalar requirements on operands or to provide 32 subslots
for two-operand instructions.
==VO==
(These instructions are presented in the same order as in the Hwacha ISA
spec, EECS-2015-262.)
Vector memory accesses are a closely-related group filling one slot in
VO. Since vlw/vsw must have one vector operand (the data to transfer),
one scalar operand (the base address), and one operand that can be
either scalar or vector (stride or index mode), two of the funct3 bits
are available to indicate direction (load/store) and format
(integer/floating point). Vector memory accesses always use the integer
register file for their scalar inputs. Due to limited encoding space,
immediate offsets are not supported. These instructions introduce a new
assembler syntax. For unit-stride, "vlw v0, (t0)" is encoded with the
x0 register selected for the stride length or as "vlw v0, (t0)[zero]"
with an explicit stride. Similarly, to load v0 starting at t0 and
incrementing by t1, "vlw v0, (t0)[t1]" would be used. For
vector-indexed loads, the stride is a vector register, with "vlw v0,
(t0)[v4]" as an example. Vector stores are processed identically to
vector loads in the scalar unit. A constant-stride load from a register
other than x0 containing zero is a "vector-splat-from-memory" operation,
filling the target vector with repeated reads from a single address. A
constant-stride store from a register other than x0 containing zero is a
"vector-splat-to-memory" operation, sequentially writing each element to
a single address. Unless the target is in non-idempotent memory, this
"vector-splat-to-memory" can simply write the last element of the vector
or nothing at all if predicated and the predicate is "all-inactive".
Vector memory access with post-increment fill half of a slot in VO; the
other half is reserved. Post-increment vector memory access is
available only for unit-stride and constant-stride accesses; the half of
the slot that would contain indexed accesses is reserved until someone
comes up with a good definition of "indexed access with
post-increment". Unlike post-increment in the base ISA, vector memory
post-increment can fit into the RISC-V 2R1W register file model for both
vector-load and vector-store. The scalar unit independently calculates
the next address after the last address that will be accessed and
updates the base register. This can be another form of the vlw/vsw
instructions, as "vlw v0, (t0++)[t1]" for constant-stride post-increment
or "vlw v0, (t0++)" for unit-stride post-increment, or it can use
distinct "vlw.postinc"/"vsw.postinc" mnemonics. The existence of
post-increment loads and stores simplifies vector stripmine loops by
removing the need to explicitly calculate pointer advances. This could
also make some inner loops independent of vector element length.
==end VO==
Some interesting encoding bits:
The basic polymorphic encoding handles vector addressing modes--unit
stride if offset register is x0, constant stride if offset is a scalar,
indexed if offset is a vector. Regardless of the normally-selected
scalar register file, memory operations always use the integer file.