Skim...
So, seems sorta like a similar system to "xmmintrin.h" and friends but
specialized for 'V'. OK.
Downside of these types of approaches is that they are awkward to use,
and notably difficult to make portable. Similarly, there are issues
making code targeting such a system portable between SIMD systems.
Similarly, dealing with variable-length vectors requires a different way
of relating to vectors than are typical in my experience.
Not that it couldn't be useful.
Reported page could for the PDF seems a bit excessive, not sure what
that is about.
Do you *really* need around 4000 pages worth of function prototypes?...
FWIW, I took a few different approaches to vectors and SIMD in my compiler:
* Approach 1:
** Leverage the existing GCC vector syntax and similar:
*** __attribute__((vector_size(16))) and similar.
*** Also allow a [vector_size(16)] shorthand.
* Approach 2:
** Glue parts of a GLSL like system onto C.
Granted, this doesn't map to variable sized vectors, my focus and
use-cases had usually more involved fixed-size vectors.
But, a short summary of my compiler's vector system.
Non-Type Types:
* __m64, non-typed 64-bit entity
** casts to/from this type copy raw bits without preserving value.
* __m128, non-typed 128-bit entity
* __m32, possible 32-bit variant of the above
In BGBCC, it is possible to move values between integer, floating-point,
and vector types, by casting through the above types (as opposed to
needing unions or memcpy). Cast-conversion is generally the most
preferred way of converting values in this case (can use direct
register-to-register operations).
Vector-type types (floating point):
* __vec2f, 2x Binary32 (64 bit)
* __vec4f, 4x Binary32 (128-bit)
* __vec2d, 2x Binary64 (128-bit)
* __vec4d, 4x Binary64 (256-bit)
* __vec2sf / __vec2h, 2x Binary16 (32-bit)
* __vec4sf / __vec4h, 4x Binary16 (64-bit)
Vector-type types (int):
* __vec4w / __vec4sw, __vec4uw: 4x Int16
* __vec4i / __vec4si, __vec4ui: 4x Int32
** Base type: Unspecified, Modulo
** 'si': Signed, Signed Saturate
** 'ui': Unsigned, Unsigned Saturate
*** Saturating does not exist natively in my SIMD ISA, so adds cost.
Vec3 sub-variant types:
* __vec3f, 3x Binary32 (128-bit)
** Essentially __vec4f, but ignores W (assumed 0).
* __vec3fx, 3x Binary32 (128-bit, value-extended)
** Twiddled vector, reuses W for 10 more mantissa bits for each element.
** The X/Y/Z elements otherwise remain in Binary32 format.
** Comparably slower as this one always uses runtime calls...
* __vec3sf / __vec3h, 3x Binary16 (64-bit)
** Like __vec3f, but smaller
* __vec3fq, 3x Binary16 (64-bit, value-extended)
** Extends each Binary16 by 5 bits.
** Likewise, non-native type, exists more to cast to/from __vec3f.
Semi-Vector:
* __fcomplex / _Complex float: Complex number, Binary32
* __dcomplex / _Complex double: Complex number, Binary64
* __quatf: Quaternion, Binary32
* __quatd: Quaternion, Binary64
Vector Contents:
* Vec2: x, y
* Vec3: x, y, z
* Vec4: x, y, z, w
* Pseudo: xy, xyz, xx, yy, ...
** These allow extracting a sub-vector or shuffling elements.
* complex: r, i
* quat: i, j, k, r
Vector operators (excluding complex and quat):
* A+B: Pairwise add
* A-B: Pairwise subtract
* A*B: Pairwise multiply
* A/B: Pairwise divide
* A^B: Dot Product
* A%B: Cross Product
There is not currently a wedge-product operator.
Complex and Quaternion:
* A*B: Complex or quaternion product
* A/B: Product of Complex or quaternion with the reciprocal of B.
Direct casting between vector types is value-preserving, so may only be
done in cases where the number of elements matches and similar.
Multiplying a vector by a scalar will scale the vector, for example:
v1=v0*(1.0/sqrt(v0^v0));
Would normalize the vector, etc...
Though, one could make a case for, say, an "_rsqrt(x)" function to
calculate 1.0/sqrt(x).
v1=v0*_rsqrt(v0^v0);
Where, say, it is faster to calculate the reciprocal square-root
directly than to calculate a square-root and then calculate the reciprocal.
Vector Literals:
* (__vec4f) { 1.0, 2.0, 3.0. 4.0 }
* Basically, similar to struct-literal creation.
* Elements given in 'x, y, z, w' order.
Note that vectors in BGBCC are pure value types, so unlike a struct, it
is not allowed to assign a vector element. You may only create a new
vector with the desired values.
If a vector construction has entirely constant values, it may be handled
as a literal, else it may be created at runtime.
Note that there is not necessarily a 1:1 mapping between the vector
operations and the underlying SIMD operations.
For operations which are not handled natively, the compiler may generate
implicit runtime calls.
There is some partial support for matrix types as well, but this is much
less used or well developed in my case (currently only exist as 2x2,
3x3, and 4x4 matrices; which may be multiplied with each other or with
the matching-length vector types). Decided not to go much into matrix
math stuff.
Generally, a matrix can be seen as a vector of 2 to 4 vectors of the
corresponding size.
Matrices are generally memory backed types though, and generally
implemented internally via runtime calls.
There was a partial wrapper header interface to some of this to allow
faking a common interface on standard C.
vec4f v0, v1, v2, v3;
v0 = v4fVec4(1, 2, 3, 4);
v1 = v4fVec4(4, 5, 6, 7);
v2 = v4fAdd(v0, v1);
v3 = v4fCross(v0, v1);
...
If using a different compiler, this can allow falling back to the use of
C structs and functions; or be mapped onto the Intel "xmmintrin.h"
system or similar and performed via SSE; or use the GCC vector system if
using GCC; ...
Granted, it is more limited and less flexible.
But, can mostly work via the magic of #ifdef's...
At present, there is no direct support for larger vectors.
Though, this was mostly because most cases where are needed, 2/3/4
element vectors are the best fit.
But, one possibility could be allowing for __vec8f and __vec16f types
and similar, and the compiler using these types natively if they exist,
or faking it otherwise (as well as adding __m256 and __m512 types and
similar).
At present these would just end up being implemented as memory-backed
types and hidden runtime calls though.
Granted, I don't really know if/how this would be efficiently mapped to
V. Handling V using compile-time vector size and built-in define's
doesn't seem like a good approach.
...
Otherwise, starting to float ideas for how to map FP-SIMD over to my
RV64 implementation.
I will probably not go with V for now as it seems like a proper V
implementation would likely be very expensive to implement on an FPGA.
Though, most likely "budget option" is to reuse the F registers for FPU
SIMD, and using register pairs for 128-bit vectors (this maps to how it
works in my existing ISA). In this case, the logical register size
remains fixed at 64 bits (but, the compiler knows to express the 128-bit
vectors as register pairs).
Would mostly reuse existing encodings, but tweak the semantics in a few
cases. Ironically, this part is already done by my implementation based
on how I had implemented the F extension, and existing compiler output
doesn't seem to notice / care that the high bits of a Binary32 register
are not NaN.
Could also allow an implementation to detect whether or not the SIMD
extension is present based on the results of trying to add a SIMD vector
in "FADD.S": If it turns into NaN or similar, no SIMD (and if it
produces the expected output vector, it supports SIMD).
Will likely add Binary16 ops and similar though (eg, Using "FADD.H" for
1/2/3/4 element Binary16 ops). Same basic encoding already defined (but
unused) if the 'F' extension (just one would use FLD/FSD for 64-bit
vectors).
This mostly leaves things like dealing with vector shuffling and similar.
Probably tacky, and non-standard, but I want something here that isn't
going to eat the FPGA (or add too much cost over what I am doing already).
I don't expect this to be taken seriously though as an extension,
probably just something for my own uses...