Public review for Non-ISA Specification: RISC-V vector intrinsic

82 views
Skip to first unread message

Kito Cheng

unread,
Oct 14, 2024, 8:03:23 AM10/14/24
to RISC-V ISA Dev, tech-a...@lists.riscv.org, Jeff Scheel, Yueh-Ting Chen, Roger Ferrer Ibanez, tech-rvv-...@lists.riscv.org
Hi everyone:

We are delighted to announce the start of the public review period for
RISC-V vector intrinsic specification.

The review period begins today, 2024/10/14, and ends on 2022/11/13 (inclusive).

This Non-ISA specification is described in the PDF available at:
https://github.com/riscv-non-isa/rvv-intrinsic-doc/releases/download/v1.0.0-rc4/

which was generated from the source available in the following GitHub repo:
https://github.com/riscv-non-isa/rvv-intrinsic-doc/

To respond to the public review, please either email comments to the
public isa-dev (isa...@groups.riscv.org) mailing list or add issues
and/or pull requests (PRs) to the GitHub repo
(https://github.com/riscv-non-isa/rvv-intrinsic-doc/). We welcome all
input and appreciate your time and effort in helping us by reviewing
the specification.

During the public review period, corrections, comments, and
suggestions, will be gathered for review by the rvv-intrinsic Task
Group. Any minor corrections and/or uncontroversial changes will be
incorporated into the specification. Any remaining issues or proposed
changes will be addressed in the public review summary report. If
there are no issues that require incompatible changes to the public
review specification, the rvv-intrinsic Task Group will recommend the
updated specifications be approved and ratified by the RISC-V
Technical Steering Committee and the RISC-V Board of Directors.

Thanks to all the contributors for all their hard work.

Kito Cheng

Paul Clarke

unread,
Oct 14, 2024, 9:25:00 AM10/14/24
to Kito Cheng, RISC-V ISA Dev, tech-a...@lists.riscv.org, tech-rvv-...@lists.riscv.org
For me, that first link (to the PDF), is a 404.

PC

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CALLt3Tic3ba-8GG87ESSpfuCKH6_D89QsbJi8pVQCs2CLWt0kg%40mail.gmail.com.

Kito Cheng

unread,
Oct 14, 2024, 9:38:23 AM10/14/24
to Paul Clarke, RISC-V ISA Dev, tech-a...@lists.riscv.org, tech-rvv-...@lists.riscv.org
Thank you notice, my bad, the correct link is here:

https://github.com/riscv-non-isa/rvv-intrinsic-doc/releases/tag/v1.0.0-rc4

BGB

unread,
Oct 14, 2024, 5:34:37 PM10/14/24
to isa...@groups.riscv.org
On 10/14/2024 8:38 AM, 'Kito Cheng' vi*a RISC-V ISA Dev wrote:
> Thank you notice, my bad, the correct link is here:
>
> https://github.com/riscv-non-isa/rvv-intrinsic-doc/releases/tag/v1.0.0-rc4
>

Skim...

So, seems sorta like a similar system to "xmmintrin.h" and friends but
specialized for 'V'. OK.

Downside of these types of approaches is that they are awkward to use,
and notably difficult to make portable. Similarly, there are issues
making code targeting such a system portable between SIMD systems.


Similarly, dealing with variable-length vectors requires a different way
of relating to vectors than are typical in my experience.

Not that it couldn't be useful.

Reported page could for the PDF seems a bit excessive, not sure what
that is about.

Do you *really* need around 4000 pages worth of function prototypes?...





FWIW, I took a few different approaches to vectors and SIMD in my compiler:
* Approach 1:
** Leverage the existing GCC vector syntax and similar:
*** __attribute__((vector_size(16))) and similar.
*** Also allow a [vector_size(16)] shorthand.
* Approach 2:
** Glue parts of a GLSL like system onto C.

Granted, this doesn't map to variable sized vectors, my focus and
use-cases had usually more involved fixed-size vectors.


But, a short summary of my compiler's vector system.

Non-Type Types:
* __m64, non-typed 64-bit entity
** casts to/from this type copy raw bits without preserving value.
* __m128, non-typed 128-bit entity
* __m32, possible 32-bit variant of the above

In BGBCC, it is possible to move values between integer, floating-point,
and vector types, by casting through the above types (as opposed to
needing unions or memcpy). Cast-conversion is generally the most
preferred way of converting values in this case (can use direct
register-to-register operations).


Vector-type types (floating point):
* __vec2f, 2x Binary32 (64 bit)
* __vec4f, 4x Binary32 (128-bit)
* __vec2d, 2x Binary64 (128-bit)
* __vec4d, 4x Binary64 (256-bit)
* __vec2sf / __vec2h, 2x Binary16 (32-bit)
* __vec4sf / __vec4h, 4x Binary16 (64-bit)

Vector-type types (int):
* __vec4w / __vec4sw, __vec4uw: 4x Int16
* __vec4i / __vec4si, __vec4ui: 4x Int32
** Base type: Unspecified, Modulo
** 'si': Signed, Signed Saturate
** 'ui': Unsigned, Unsigned Saturate
*** Saturating does not exist natively in my SIMD ISA, so adds cost.

Vec3 sub-variant types:
* __vec3f, 3x Binary32 (128-bit)
** Essentially __vec4f, but ignores W (assumed 0).
* __vec3fx, 3x Binary32 (128-bit, value-extended)
** Twiddled vector, reuses W for 10 more mantissa bits for each element.
** The X/Y/Z elements otherwise remain in Binary32 format.
** Comparably slower as this one always uses runtime calls...
* __vec3sf / __vec3h, 3x Binary16 (64-bit)
** Like __vec3f, but smaller
* __vec3fq, 3x Binary16 (64-bit, value-extended)
** Extends each Binary16 by 5 bits.
** Likewise, non-native type, exists more to cast to/from __vec3f.

Semi-Vector:
* __fcomplex / _Complex float: Complex number, Binary32
* __dcomplex / _Complex double: Complex number, Binary64
* __quatf: Quaternion, Binary32
* __quatd: Quaternion, Binary64

Vector Contents:
* Vec2: x, y
* Vec3: x, y, z
* Vec4: x, y, z, w
* Pseudo: xy, xyz, xx, yy, ...
** These allow extracting a sub-vector or shuffling elements.
* complex: r, i
* quat: i, j, k, r

Vector operators (excluding complex and quat):
* A+B: Pairwise add
* A-B: Pairwise subtract
* A*B: Pairwise multiply
* A/B: Pairwise divide
* A^B: Dot Product
* A%B: Cross Product

There is not currently a wedge-product operator.

Complex and Quaternion:
* A*B: Complex or quaternion product
* A/B: Product of Complex or quaternion with the reciprocal of B.

Direct casting between vector types is value-preserving, so may only be
done in cases where the number of elements matches and similar.

Multiplying a vector by a scalar will scale the vector, for example:
v1=v0*(1.0/sqrt(v0^v0));
Would normalize the vector, etc...

Though, one could make a case for, say, an "_rsqrt(x)" function to
calculate 1.0/sqrt(x).
v1=v0*_rsqrt(v0^v0);
Where, say, it is faster to calculate the reciprocal square-root
directly than to calculate a square-root and then calculate the reciprocal.


Vector Literals:
* (__vec4f) { 1.0, 2.0, 3.0. 4.0 }
* Basically, similar to struct-literal creation.
* Elements given in 'x, y, z, w' order.

Note that vectors in BGBCC are pure value types, so unlike a struct, it
is not allowed to assign a vector element. You may only create a new
vector with the desired values.

If a vector construction has entirely constant values, it may be handled
as a literal, else it may be created at runtime.


Note that there is not necessarily a 1:1 mapping between the vector
operations and the underlying SIMD operations.

For operations which are not handled natively, the compiler may generate
implicit runtime calls.



There is some partial support for matrix types as well, but this is much
less used or well developed in my case (currently only exist as 2x2,
3x3, and 4x4 matrices; which may be multiplied with each other or with
the matching-length vector types). Decided not to go much into matrix
math stuff.

Generally, a matrix can be seen as a vector of 2 to 4 vectors of the
corresponding size.

Matrices are generally memory backed types though, and generally
implemented internally via runtime calls.



There was a partial wrapper header interface to some of this to allow
faking a common interface on standard C.
vec4f v0, v1, v2, v3;
v0 = v4fVec4(1, 2, 3, 4);
v1 = v4fVec4(4, 5, 6, 7);
v2 = v4fAdd(v0, v1);
v3 = v4fCross(v0, v1);
...

If using a different compiler, this can allow falling back to the use of
C structs and functions; or be mapped onto the Intel "xmmintrin.h"
system or similar and performed via SSE; or use the GCC vector system if
using GCC; ...

Granted, it is more limited and less flexible.
But, can mostly work via the magic of #ifdef's...



At present, there is no direct support for larger vectors.
Though, this was mostly because most cases where are needed, 2/3/4
element vectors are the best fit.

But, one possibility could be allowing for __vec8f and __vec16f types
and similar, and the compiler using these types natively if they exist,
or faking it otherwise (as well as adding __m256 and __m512 types and
similar).
At present these would just end up being implemented as memory-backed
types and hidden runtime calls though.


Granted, I don't really know if/how this would be efficiently mapped to
V. Handling V using compile-time vector size and built-in define's
doesn't seem like a good approach.

...



Otherwise, starting to float ideas for how to map FP-SIMD over to my
RV64 implementation.

I will probably not go with V for now as it seems like a proper V
implementation would likely be very expensive to implement on an FPGA.

Though, most likely "budget option" is to reuse the F registers for FPU
SIMD, and using register pairs for 128-bit vectors (this maps to how it
works in my existing ISA). In this case, the logical register size
remains fixed at 64 bits (but, the compiler knows to express the 128-bit
vectors as register pairs).

Would mostly reuse existing encodings, but tweak the semantics in a few
cases. Ironically, this part is already done by my implementation based
on how I had implemented the F extension, and existing compiler output
doesn't seem to notice / care that the high bits of a Binary32 register
are not NaN.

Could also allow an implementation to detect whether or not the SIMD
extension is present based on the results of trying to add a SIMD vector
in "FADD.S": If it turns into NaN or similar, no SIMD (and if it
produces the expected output vector, it supports SIMD).

Will likely add Binary16 ops and similar though (eg, Using "FADD.H" for
1/2/3/4 element Binary16 ops). Same basic encoding already defined (but
unused) if the 'F' extension (just one would use FLD/FSD for 64-bit
vectors).

This mostly leaves things like dealing with vector shuffling and similar.


Probably tacky, and non-standard, but I want something here that isn't
going to eat the FPGA (or add too much cost over what I am doing already).


I don't expect this to be taken seriously though as an extension,
probably just something for my own uses...

Kito Cheng

unread,
Nov 7, 2024, 4:31:36 AM11/7/24
to BGB, isa...@groups.riscv.org
Hi BGB:

Thanks so much for your detailed response! Let me address your points
one by one:

- Why are there so many APIs? (4000 pages!!!)
- Support for GNU vectors (`__attribute__((vector_size(16)))`)
- Reusing the FPU's F registers as vector registers

## Why Are There So Many APIs? (4000 pages!!!)

Let’s start with this one, as it’s a question a lot of people have.
The answer lies in the design of the RVV instruction set itself and
the type system in the RVV intrinsics, and it’s a bit of a long story.

RVV’s main features include:

Using vsetvli to set VL, determining the length of the computation
Using vsetvli to set SEW, enabling operations on different data widths
Setting LMUL to use multiple consecutive vector registers in one operation
Configurable mask and tail policies

So, for an instruction like vadd.vv, you can have combinations of i8,
i16, i32, and i64. Paired with LMUL settings (mf8, mf4, mf2, m1, m2,
m4, m8), just vadd.vv alone has 4 x 7 = 28 combinations, not counting
signed and unsigned variations.

For RVV, we had several options for the type system:

Follow the x86 intrinsic type system by defining types based on
register size, e.g., `__m64`, `__m128`
Follow ARM's intrinsic type system by defining more abstract types,
e.g., `svint32_t`, `svint64_t`

We chose an interface closer to ARM SVE’s more abstract type system.
However, because RVV also requires LMUL information to be resolved at
compile-time, we included LMUL in the type name, resulting in types
like `v<type>e<width><lmul>_t`, e.g., `vint32m1_t`. This design brings
us back to 4 x 7 = 28 combinations for vadd.vv. When considering
signed and unsigned separately, this doubles to 56. While we don’t
support every LMUL/SEW combination, there are still about 44
variations, which is already significant.

With mask on/off variations, this doubles to 88.

On top of that, while we default to ta/ma policy, we also offer tu and
mu versions to cover all possibilities, bringing the total to 264.

At this point, you might wonder: why not make it simpler by setting
everything with vsetvli as in assembly, so subsequent operations just
follow those settings?

Unfortunately, from a compiler perspective, this is difficult to
implement and would make analysis and optimization challenging. We
believe this design makes the code easier to maintain, so for users
who prefer this style, we recommend inline assembly or pure assembly.

With approximately 390 instructions (including pseudo-instructions) in
the RVV set, this design results in a large number of intrinsic APIs.

Does this mean RVV intrinsics are hard to use? Not really! The API
follows a consistent pattern, so most users can start writing simple
RVV programs after reading a few examples. The spec can be used like a
reference dictionary, and the strong typing helps catch potential
errors early on.

[1] https://github.com/riscv-non-isa/rvv-intrinsic-doc/issues/60

## Support for GNU Vectors (`__attribute__((vector_size(16)))`)

Support for GNU vectors isn’t part of the RVV intrinsic spec, but both
GCC and Clang/LLVM support them. Code generation is determined by the
presence of a vector extension. For example:

```c
#include <stdint.h>
typedef int int32x4_t __attribute__((vector_size(16)));
void foo(int32_t *a, int32_t *b, int32_t *c) {
int32x4_t va = *(int32x4_t*)a;
int32x4_t vb = *(int32x4_t*)b;
int32x4_t vc = va + vb;
*(int32x4_t*)c = vc;
}
```

With a vector extension, this can produce output like:
```c
foo:
vsetivli zero,4,e32,m1,ta,ma
vle32.v v1,0(a0)
vle32.v v2,0(a1)
vadd.vv v1,v1,v2
vse32.v v1,0(a2)
ret
```

We consider this more of a language extension rather than part of
intrinsics, so for RVV Intrinsics 1.0, there’s no plan to define it
further.

## Reusing FPU’s F Registers as Vector Registers

While this is beyond the scope of intrinsics and falls under ISA spec,
it’s a fascinating topic. In AArch64, the FPU, NEON, and SVE registers
are indeed shared. This could reduce hardware costs, and RISC-V has a
similar example in the Zfinx extension, where GPRs are used for
floating-point operations instead of F registers. However, the current
RVV intrinsic spec could be compatible with this kind of register
reuse since overlapping registers wouldn’t impact intrinsics—only ABI.

This is definitely an interesting topic, so feel free to bring it up
at sig-v...@lists.riscv.org.

Thanks again for your feedback!
Reply all
Reply to author
Forward
0 new messages