3D Matrix-style operations / primitives

64 views
Skip to first unread message

lkcl

unread,
Sep 18, 2019, 1:24:27 AM9/18/19
to RISC-V ISA Dev, Libre-RISCV General Development
does anyone know of some mathematics for analysing which would be the best "primitives" for Matrix operations, suited to transposition and inversion, determinant and normalisation?

for 3D that generally means just 2x2, 3x3 and 4x4.

i'm looking up how matrix inverses are calculated and, hoo-boy :)

normalisation looks to be just "divide by the determinant":

so... i am logically deducing that if you wanted something RISC-like, you'd have operations for transpose and determinant?  or... can determinant be broken down further into something elegant?

whilst 2x2 looks pretty obvious - 0,0 x 1,1 - 1,0 x 0,1 - it looks like it goes recursive from there, with permutations:

at that point, honestly, i'm scared/alarmed to even recommend a Matrix Determinant opcode!

likewise, Transpose appears to be a series of MV operations with offsets, which tends to suggest that there may be some vector primitives that would be really good to have, that could be added to this:



a quick search "matrix inverse swizzle" shows this:

which mentions __mm_shuffle_epl32, _mm_shuffle_ps, _mm_movelp_ps and that leads on a merry search to SSE/AVX/AVX-512:

looovely :)

at which point i am definitely lost.  does anyone have any suggestions?

l.


Jacob Lifshay

unread,
Sep 18, 2019, 2:28:38 AM9/18/19
to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Libre-RISCV General Development
On Tue, Sep 17, 2019, 22:24 lkcl <luke.l...@gmail.com> wrote:
does anyone know of some mathematics for analysing which would be the best "primitives" for Matrix operations, suited to transposition and inversion, determinant and normalisation?

for 3D that generally means just 2x2, 3x3 and 4x4.

Generally, for 3D graphics, matrix left/right-multiplication by vector/matrix and transpose is by far more common than all the other matrix operations combined, so having special HW support for inverse, determinant, and normalize (never heard of that applied to matrices before) is probably unnecessary.


i'm looking up how matrix inverses are calculated and, hoo-boy :)

normalisation looks to be just "divide by the determinant":

so... i am logically deducing that if you wanted something RISC-like, you'd have operations for transpose and determinant?  or... can determinant be broken down further into something elegant?

whilst 2x2 looks pretty obvious - 0,0 x 1,1 - 1,0 x 0,1 - it looks like it goes recursive from there, with permutations:

at that point, honestly, i'm scared/alarmed to even recommend a Matrix Determinant opcode!

yeah, determinant is an O(n^3) operation (though there may be something faster for really big n)

likewise, Transpose appears to be a series of MV operations with offsets, which tends to suggest that there may be some vector primitives that would be really good to have, that could be added to this:



a quick search "matrix inverse swizzle" shows this:

which mentions __mm_shuffle_epl32, _mm_shuffle_ps, _mm_movelp_ps and that leads on a merry search to SSE/AVX/AVX-512:

looovely :)

at which point i am definitely lost.  does anyone have any suggestions?

the algorithms I've generally used are an unrolled form of gauss jordan elimination or just using the formula from a symbolic math program [1], at which point, like operations can be grouped together.

[1]:
type the following into maxima:
m:apply(matrix, makelist(makelist(concat(v, i, j), j, 0, 3), i, 0, 3)); grind(invert(m))$

I've not generally had matrix inverse on a fast path, instead passing a matrix and it's inverse together if needed and only calculating the inverse when I generate the matrix.

For most 3D programs, matrices are used much more than they are generated, so matrix inverse shouldn't generally be in the fast path, if the program is designed well.

Jacob

lkcl

unread,
Sep 18, 2019, 3:00:56 AM9/18/19
to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org
On Wednesday, September 18, 2019 at 2:28:38 PM UTC+8, Jacob Lifshay wrote:


> the algorithms I've generally used are an unrolled form of gauss jordan elimination or just using the formula from a symbolic math program [1], at which point, like operations can be grouped together.
>
>
> [1]:
> type the following into maxima:
> m:apply(matrix, makelist(makelist(concat(v, i, j), j, 0, 3), i, 0, 3)); grind(invert(m))$


https://brilliant.org/wiki/gaussian-elimination/

That looks very much like the equation substitution system i was taught at A Level. Multiply both sides by blah blah, substitute or subtract and eliminate blah blah.

Neat. Of course it applies to 3x3

>
> I've not generally had matrix inverse on a fast path, instead passing a matrix and it's inverse together if needed and only calculating the inverse when I generate the matrix.
>
>
> For most 3D programs, matrices are used much more than they are generated, so matrix inverse shouldn't generally be in the fast path, if the program is designed well.

The idea from SIGGRAPH2019 by Pixilica, the driving force, is to go into long-tail applications and thus design something that is useful for innovation where mass volume GPUs have suboptimal performance, legacy issues precisely because of their profit driven focus on highest volume makets.

Where one of these algorithms (determinant ON^3) is so long, I would be concerned even despite their potential appeal, without some form of CSR FSM reentrant state, and even then it makes me nervous.

The compromise is to in great detail the best algorithms then target the ISA at parallellisable operations that will reorder the data.

A year ago I did design a REMAP system for SV which allowed up to a 3D permutation of elements.

I am however tempted to suggest just using MV.X regs[rd] = regs[regs[rs1]] as as means to arbitrarily reorder data. Precompute a transposition set of indices and let a MV take care of them.

Likewise for determinant, MV can be used to extract the elements to reorder them into petmuted pairs that would allow them to be multiplied as 2 vectors, followed by the subtractions as a second set.

Likewise for Inverse. However MV.X is pretty expensive.

I wonder there if an answer might be to permit the rs1 vector elements to be typecast to elwidth of 8? More to the point, SV already has that capability... or does it...

No I don't think it does. The assumption is that the element indices are always XLEN wide and it is the data to which elwidth applies in the src.

Have to fix that by adding a fmt field to MV.X let me just update the page

lkcl

unread,
Sep 18, 2019, 3:20:50 AM9/18/19
to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org


On Wednesday, September 18, 2019 at 8:00:56 AM UTC+1, lkcl wrote:
 

No I don't think it does.  The assumption is that the element indices are always XLEN wide and it is the data to which elwidth applies in the src.

Have to fix that by adding a fmt field to MV.X let me just update the page]


 The idea here is to allow 8-bit indices to be stored inside XLEN-sized
registers, such that rather than doing this:

.. parsed-literal::
    ldimm x8, 1
    ldimm x9, 3
    ldimm x10, 2
    ldimm x11, 0
    {SVP.VL=4} MV.X x3, x8, elwidth=default

The alternative is this:
    
.. parsed-literal::
    ldimm x8, 0x00020301
    {SVP.VL=4} MV.X x3, x8, elwidth=8

Thus compacting four indices into the one register. x3 and x8's element
width are *independent* of the MV.X elwidth, thus allowing both source
and element element widths of the *elements* to be moved to be over-ridden,
whilst *at the same time* allowing the *indices* to be compacted, as well.


what do you think?

l.

Jacob Lifshay

unread,
Sep 18, 2019, 3:41:49 AM9/18/19
to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Libre-RISCV General Development
mv.x with 8-bit indexes sounds like a good idea.

assuming a vector of 4x4 matrixes is stored as 4 separate vectors with subvl=4 in struct-of-array-of-struct form (the form I've been planning on using):
using standard (4+4) -> 4 swizzle instructions with 2 input vectors with subvl=4 and 1 output vector with subvl, a vectorized matrix transpose operation can be done in 2 steps with 4 instructions per step to give 8 instructions in total:

input:
| m00 m10 m20 m30 |
| m01 m11 m21 m31 |
| m02 m12 m22 m32 |
| m03 m13 m23 m33 |

transpose 4 corner 2x2 matrices

intermediate:
| m00 m01 m20 m21 |
| m10 m11 m30 m31 |
| m02 m03 m22 m23 |
| m12 m13 m32 m33 |

finish transpose

output:
| m00 m01 m02 m03 |
| m10 m11 m12 m13 |
| m20 m21 m22 m23 |
| m30 m31 m32 m33 |

alternatively, if loading or storing, the transpose can be combined with the load/store instructions to give 4 instructions (strided load/stores might work, haven't checked)

Jacob

lkcl

unread,
Sep 18, 2019, 4:06:19 AM9/18/19
to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org
On Wednesday, September 18, 2019 at 3:41:49 PM UTC+8, Jacob Lifshay wrote:

> mv.x with 8-bit indexes sounds like a good idea.

Yehyeh. I wonder...

>
> assuming a vector of 4x4 matrixes is stored as 4 separate vectors with subvl=4 in struct-of-array-of-struct form (the form I've been planning on using):
> using standard (4+4) -> 4 swizzle instructions with 2 input vectors with subvl=4 and 1 output vector with subvl, a vectorized matrix transpose operation can be done in 2 steps with 4 instructions per step to give 8 instructions in total:

How about 5?

ldimm x4, 0x0004080c # transposition indices, packed 8bit
{SVP.VL=4} MV.X x8, x4, elwidth=8
{SVP.VL=4} MV.X x9, x4, elwidth=8
{SVP.VL=4} MV.X x10, x4, elwidth=8
{SVP.VL=4} MV.X x11, x4, elwidth=8

You remember the idea we had 8 months ago to make the offsets relative to rd?

How about 2? :)

ldimm x4, 0x0004080c # transposition indices, packed 8bit
{SVP.VL=4,SUBVL=4} MV.X x8, x4, elwidth=8

Would that even work? Hm it would work if there was a special bit or opcode to apply the offsets to SUBVL looping but not VL

That would be pretty awesome.

Jacob Lifshay

unread,
Sep 18, 2019, 4:16:25 AM9/18/19
to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Libre-RISCV General Development
that could work, but would still take multiple clock cycles per matrix due to the sheer number of registers read. also, we'd need some serious hw optimizations to mv.x to have it be faster than the 8 cycles/matrix of the swizzle-based transpose (we'd need something like bin-packing or a register-read cache, sounds like huge area and power).

i personally think that we should just have a transpose instruction/micro-instruction that has just as much hw as needed. it could just expand to that swizzle sequence internally.

Jacob

lkcl

unread,
Sep 18, 2019, 4:34:14 AM9/18/19
to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org
On Wednesday, September 18, 2019 at 4:16:25 PM UTC+8, Jacob Lifshay wrote:
> On Wed, Sep 18, 2019, 01:06 lkcl <luke.l...@gmail.com> wrote:
> On Wednesday, September 18, 2019 at 3:41:49 PM UTC+8, Jacob Lifshay wrote:
>
>
>
> > mv.x with 8-bit indexes sounds like a good idea.
>
>
>
> Yehyeh.  I wonder...
>
>
>
> >
>
> > assuming a vector of 4x4 matrixes is stored as 4 separate vectors with subvl=4 in struct-of-array-of-struct form (the form I've been planning on using):
>
> > using standard (4+4) -> 4 swizzle instructions with 2 input vectors with subvl=4 and 1 output vector with subvl, a vectorized matrix transpose operation can be done in 2 steps with 4 instructions per step to give 8 instructions in total:
>
>
>
> How about 5?
>
>
>
> ldimm x4, 0x0004080c # transposition indices, packed 8bit
>
> {SVP.VL=4} MV.X x8, x4, elwidth=8
>
> {SVP.VL=4} MV.X x9, x4, elwidth=8
>
> {SVP.VL=4} MV.X x10, x4, elwidth=8
>
> {SVP.VL=4} MV.X x11, x4, elwidth=8
>
>
>
> You remember the idea we had 8 months ago to make the offsets relative to rd?
>
>
>
> How about 2? :)
>
>
>
> ldimm x4, 0x0004080c # transposition indices, packed 8bit
>
> {SVP.VL=4,SUBVL=4} MV.X x8, x4, elwidth=8
>
>
>
> Would that even work? Hm it would work if there was a special bit or opcode to apply the offsets to SUBVL looping but not VL
>
>
>
> that could work, but would still take multiple clock cycles per matrix due to the sheer number of registers read.

One of the downsides of MV.X (and vector shuffle in general).

> also, we'd need some serious hw optimizations to mv.x to have it be faster than the 8 cycles/matrix of the swizzle-based transpose (we'd need something like bin-packing or a register-read cache, sounds like huge area and power).

It would be 2R1W per element rather than the usual 1R1W per element.

In the Dependency Matrix Ooo it would be much more like how predication has to work:

* allocate the MVs to Function Units
* have a shadow per MV.X waiting on the rs1 read
* when the read gets the offset out the regfile pass it to the MV.X operation
* release the shadow or raise illegal instruction if the offset was out of range of the regfile

Actually now that I think about it, SUBVL offsets need only be read once.

Now that I think about it, the similarity to LD addressindex mode is so similar it might be possible to share the same hardware.

> i personally think that we should just have a transpose instruction/micro-instruction that has just as much hw as needed. it could just expand to that swizzle sequence internally.

It's a biig frickin pipeline and it's atomic (unless a CSR State reg is used, which then has to be context switched always)

Will record the sequence above.


>
> Jacob

lkcl

unread,
Sep 18, 2019, 5:14:04 AM9/18/19
to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org
On Wednesday, September 18, 2019 at 3:41:49 PM UTC+8, Jacob Lifshay wrote:

> | m00 m10 m20 m30 |
> | m01 m11 m21 m31 |
> | m02 m12 m22 m32 |
> | m03 m13 m23 m33 |
>
>
> transpose 4 corner 2x2 matrices
>
>
>
> intermediate:
> | m00 m01 m20 m21 |
> | m10 m11 m30 m31 |
> | m02 m03 m22 m23 |
> | m12 m13 m32 m33 |

How does that work with only SUBVL swizzle in only 4 intructions? The swizzles only apply horizontally, afaik.

L.

Jacob Lifshay

unread,
Sep 18, 2019, 5:36:17 AM9/18/19
to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Libre-RISCV General Development
didn't think through it completely, since I know it can be done...

anyway, here's a web page showing one way it can be done (the swizzles will need to be decoded from the matrices provided, or just look up the sse2 instructions):

Jacob

Rogier Brussee

unread,
Sep 18, 2019, 10:29:11 AM9/18/19
to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org


Op woensdag 18 september 2019 08:28:38 UTC+2 schreef Jacob Lifshay:
On Tue, Sep 17, 2019, 22:24 lkcl <luke.l...@gmail.com> wrote:
does anyone know of some mathematics for analysing which would be the best "primitives" for Matrix operations, suited to transposition and inversion, determinant and normalisation?

for 3D that generally means just 2x2, 3x3 and 4x4.

Generally, for 3D graphics, matrix left/right-multiplication by vector/matrix and transpose is by far more common than all the other matrix operations combined, so having special HW support for inverse, determinant, and normalize (never heard of that applied to matrices before) is probably unnecessary.


Linear algebra can be arbitrarily hard. Numerical linear algebra is very hard in practice. The canonical primitives are in the BLAS Fortran library. That said THE primitive is dot product between vectors. 
 

i'm looking up how matrix inverses are calculated and, hoo-boy :)


NOOH to do do matrix inversion (which you basically never want to do) you use the QR decomposition of the matrix i.e. write A = QR with Q an orthogonal matrix and R upper triangular which can be chosen with non negative elements on the diagonal. 

 Then
A^{-1} = R^{-1}Q^{-1} = R^{-1}Q^t. 

Computing a QR decomposition is a O(n^3) operation, and doing it numerically stable is well studied but difficult.

 
normalisation looks to be just "divide by the determinant":


NOOOH. She divides every column by the length of the column. 
so... i am logically deducing that if you wanted something RISC-like, you'd have operations for transpose and determinant?  or... can determinant be broken down further into something elegant?


NOOOH.  Determinants are very slick theoretically, but seldom useful for computations unless the matrix has special form.

whilst 2x2 looks pretty obvious - 0,0 x 1,1 - 1,0 x 0,1 - it looks like it goes recursive from there, with permutations:

at that point, honestly, i'm scared/alarmed to even recommend a Matrix Determinant opcode!

yeah, determinant is an O(n^3) operation (though there may be something faster for really big n)

No it really is horrible, O(n^4). You have to basically compute the QR decomposition  then det(A) = det(QR) = det(Q)det(R) = det(R) = +/-1 * prod diagonal elements of R.
and I am not even sure how you can track the sign of the determinant while decomposing.

Of course n =2 is really easy, and n= 3 can easily but tediously expanded.

Allen Baum

unread,
Sep 18, 2019, 12:26:18 PM9/18/19
to Rogier Brussee, RISC-V ISA Dev, lkcl, Libre-RISCV General Development
I'm not reading all this in excruciating detail... but isn't matrix transpose solved with vector gather?
The Livermore S1 had "hardware" transpose  op, but as I recall it was always cacheline aligned.
This allowed them to effectively transpose arbitrarily large matrices by exchanging transposed lines, which minimized the memory traffic.
I'd have to dig out my manuals, deep, deep in the garage to be sure of the exact semantics (though I expect they're online somewhere).


--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/9b038581-f4b0-499b-b948-abe6c77b49aa%40groups.riscv.org.

lkcl

unread,
Sep 18, 2019, 3:13:46 PM9/18/19
to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org
On Wednesday, September 18, 2019 at 3:29:11 PM UTC+1, Rogier Brussee wrote:
 
 
normalisation looks to be just "divide by the determinant":


NOOOH. She divides every column by the length of the column. 

i can hear the screams and writhiing from 10,000 miles away, rogier :)

this tends to reinforce the idea of having base primitives (vector gather etc., thank you allen), which reorder registers / memory and let it get sorted out over time by software engineering.

Atif from Pixilica proposed microcoding (or, similar to that) of actual matrix operations: your reaction, Rogier, suggests that whatever algorithm is hard-coded into microcode is going to be a risk and/or require *huge* amounts of mathematical research to get right.

l.

lkcl

unread,
Sep 18, 2019, 3:19:57 PM9/18/19
to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

hi stephen, something weird (again), your message didn't get through, here.  the Libre RISC-V SoC extends the 32-entry regfile to 128 64-bit regs per file, with the ability to subdivide them down to the byte-level in a SIMD-like fashion but actually more like a c typecast/union of uint8_t[], uint16_t[]], uint32_t[] and uint64_t[] arrays.

l.



Stephen Fuld SF...@alumni.cmu.edu.invalid via googlegroups.com 




> and so if we were to have a (parallelised) MV.swizzle or other primitive that could span (jump) across some of those, we'd have the basics.


A few comments.

This requires more than 16 registers, as the 4x4 matrix takes 16 and 
presumably the determinant (result) takes another.

You want to do inverse, but not all matrices are invertible.  How do 
propose to handle that.

Determinant is certainly doable, but it seems even more complicated than 
the complex multiply and divide I asked about in another thread.  AFAIK, 
no CPU does even those.


> how do you go about analysing this?  has research like this been done before, into designing an ISA based around matrix multiplication (not so much revolving around SIMD, more around Vector Architectures like Cray)?


I vaguely recall some graphics oriented chip doing 4x4 matrix multiply 
to speed a common graphics transform, but I don't remember any details. 
:-(

lkcl

unread,
Sep 19, 2019, 1:16:51 PM9/19/19
to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org
On Wednesday, September 18, 2019 at 10:36:17 AM UTC+1, Jacob Lifshay wrote:

anyway, here's a web page showing one way it can be done (the swizzles will need to be decoded from the matrices provided, or just look up the sse2 instructions):

the ones being used are punpckldq and punpcklqdq, which are described here:

NP 0F 62 /r1 PUNPCKLDQ mm, mm/m32AV/VMMXInterleave low-order doublewords from mm and mm/m32 into mm.
 
the key being "interleave".  so out of 128 bits (4x 32-bits)... hey it's the 128-bit equivalent of PACK! :)  that's very funny.

i'm not seeing how MV.SWIZZLE would help, given that MV.SWIZZLE only takes one source register.

l.

Jacob Lifshay

unread,
Sep 19, 2019, 3:34:05 PM9/19/19
to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Libre-RISCV General Development
not that particular swizzle instruction, but a generalized 2-source swizzle (like LLVM's IR instruction).

Jacob
Reply all
Reply to author
Forward
0 new messages