%0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16%1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16%2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1)store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16
typedef float mf4x4_t __attribute__((matrix_type(4, 4)));mf4x4_t add(mf4x4_t a, mf4x4_t b) {return __builtin_matrix_multiply(a, b);}
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> %0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16
> %1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16
> %2 = call <4 x 4 x float>
> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0,
> <4 x 4 x float> %1)
> store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16
This sounds very interesting. Would it make sense to later expand the
idea to allow an arbitrary number of dimensions? Maybe that doesn't
make sense if we're restricted to statically-known dimensions.
How would this relate to scalable vectors? Most of the time matrix
dimensions are not known statically. Would <n x m x float> be possible?
Do you have a prototype of this?
-David
Yes, that matches my current feeling too.
> How would this relate to scalable vectors?
Scalable vectors would be a possible of lowering of the matrix type. I *believe* you'd need go generate loops or at least some conditional code at run time due to the unknown scale factor.
> Most of the time matrix dimensions are not known statically. Would <n x m x float> be possible?
No, we only support statically-known dimensions.
> Do you have a prototype of this?
Yes. I can make it available it there is interest.
Adam
Hi,We are proposing first-class type support for a new matrix type. This is a natural extension of the current vector type with an extra dimension.
For example, this is what the IR for a matrix multiply would look like for a 4x4 matrix with element type float:%0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16%1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16%2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1)store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16
Currently we support element-wise binary operations, matrix multiply, matrix-scalar multiply, matrix transpose, extract/insert of an element. Besides the regular full-matrix load and store, we also support loading and storing a matrix as a submatrix of a larger matrix in memory. We are also planning to implement vector-extract/insert and matrix-vector multiply.
All of these are currently implemented as intrinsics. Where applicable we also plan to support these operations with native IR instructions (e.g. add/fadd).
These are exposed in clang via builtins. E.g. the above operations looks like this in C/C++:typedef float mf4x4_t __attribute__((matrix_type(4, 4)));mf4x4_t add(mf4x4_t a, mf4x4_t b) {return __builtin_matrix_multiply(a, b);}
** Benefits **
Having matrices represented as IR values allows for the usual algebraic and redundancy optimizations. But most importantly, by lifting memory aliasing concerns, we can guarantee vectorization to target-specific vectors. Having a matrix-multiply intrinsic also allows using FMA regardless of the optimization level which is the usual sticking point with adopting FP-contraction.
Adding a new dedicated first-class type has several advantages over mapping them directly to existing IR types like vectors in the front end. Matrices have the unique requirement that both rows and columns need to be accessed efficiently. By retaining the original type, we can analyze entire chains of operations (after inlining) and determine the most efficient intermediate layout for the matrix values between ABI observable points (calls, memory operations).
The resulting intermediate layout could be something like a single vector spanning the entire matrix or a set of vectors and scalars representing individual rows/columns. This is beneficial for example because rows/columns would be aligned to the HW vector boundary (e.g. for a 3x3 matrix).The layout could also be made up of tiles/submatrices of the matrix. This is an important case for us to fight register pressure. Rather than loading entire matrices into registers it lets us load only parts of the input matrix at a time in order to compute some part of the output matrix. Since this transformation reorders memory operations, we may also need to emit run-time alias checks.
Having a dedicated first-class type also allows for dedicated target-specific ABIs for matrixes. This is a pretty rich area for matrices. It includes whether the matrix is stored row-major or column-major order. Whether there is padding between rows/columns. When and how matrices are passed in argument registers. Providing flexibility on the ABI side was critical for the adoption of the new type at Apple.
Having all this knowledge at the IR level means that front-ends are able to share the complexities of the implementation. They just map their matrix type to the IR type and the builtins to intrinsics.
At Apple, we also need to support interoperability between row-major and column-major layout. Since conversion between the two layouts is costly, they should be separate types requiring explicit instructions to convert between them. Extending the new type to include the order makes tracking the format easy and allows finding optimal conversion points.
** ABI **
We currently default to column-major order with no padding between the columns in memory. We have plans to also support row-major order and we would probably have to support padding at some point for targets where unaligned accesses are slow. In order to make the IR self-contained I am planning to make the defaults explicit in the DataLayout string.
For function parameters and return values, matrices are currently placed in memory. Moving forward, we should pass small matrices in vector registers. Treating matrices as structures of vectors seems a natural default. This works well for AArch64, since Homogenous Short-Vector Aggregates (HVA) can use all 8 SIMD argument registers. Thus we could pass for example two 4 x 4 x float matrices in registers. However on X86, we can only pass “four eightbytes”, thus limiting us to two 2 x 2 x float matrices.
Alternatively, we could treat a matrix as if its rows/columns were passed as separate vector arguments. This would allow using all 8 vector argument registers on X86 too.Alignment of the matrix type is the same as the alignment of its first row/column vector.
** Flow **
Clang at this point mostly just forwards everything to LLVM. Then in LLVM, we have an IR function pass that lowers the matrices to target-supported vectors. As with vectors, matrices can be of any static size with any of the primitive types as the element type.After the lowering pass, we only have matrix function arguments and instructions building up and splitting matrix values from and to vectors. CodeGen then lowers the arguments and forwards the vector values. CodeGen is already capable of further lowering vectors of any size to scalars if the target does not support vectors.The lowering pass is also run at -O0 rather than legitimizing the matrix type during CodeGen like it’s done for structure values or invalid vectors. I don’t really see a big value of duplicating this logic across the IR and CodeGen. We just need a lighter mode in the pass at -O0.** Roll-out and Maintenance **
Since this will be experimental for some time, I am planning to put this behind a flag: -fenable-experimental-matrix-type. ABI and intrinsic compatibility won’t be guaranteed initially until we lift the experimental status.We are obviously interested in maintaining and improving this code in the future.Looking forward to comments and suggestions.Thanks,Adam
_______________________________________________
cfe-dev mailing list
cfe...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
On the LLVM IR side, I'm personally unconvinced that we should model matrices in the IR directly as a new first-class type, unless there's some target out there that has matrix operations in hardware / matrix registers, but IR is not really my area of expertise so give that opinion as much or little weight as you see fit. However, I do wonder: how is this different from, say, complex numbers, for which we don't have a native IR representation? (Maybe the answer is that we should have native IR support for complex numbers too.) How would you expect the frontend to lower (eg) matrix multiplication for matrices of complex numbers?
On Oct 11, 2018, at 3:42 PM, Richard Smith <ric...@metafoo.co.uk> wrote:On Wed, 10 Oct 2018 at 23:10, Adam Nemet via cfe-dev <cfe...@lists.llvm.org> wrote:Hi,We are proposing first-class type support for a new matrix type. This is a natural extension of the current vector type with an extra dimension.
For example, this is what the IR for a matrix multiply would look like for a 4x4 matrix with element type float:%0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16%1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16%2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1)store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16
Currently we support element-wise binary operations, matrix multiply, matrix-scalar multiply, matrix transpose, extract/insert of an element. Besides the regular full-matrix load and store, we also support loading and storing a matrix as a submatrix of a larger matrix in memory. We are also planning to implement vector-extract/insert and matrix-vector multiply.
All of these are currently implemented as intrinsics. Where applicable we also plan to support these operations with native IR instructions (e.g. add/fadd).
These are exposed in clang via builtins. E.g. the above operations looks like this in C/C++:typedef float mf4x4_t __attribute__((matrix_type(4, 4)));mf4x4_t add(mf4x4_t a, mf4x4_t b) {return __builtin_matrix_multiply(a, b);}This seems to me to be proposing two distinct things -- built-in matrix support in the frontend as a language extension, and support for matrices in LLVM IR -- and I think it makes sense to discuss them at least somewhat separately.On the Clang side: our general policy for frontend language extensions is described here: http://clang.llvm.org/get_involved.htmlI'm totally happy to assume that you can cover most of those points, but point 4 seems likely to be a potential sticking point. Have you talked to WG14 about adding a matrix extension to C? (I'd note that they don't even have a vector language extension yet, but as noted on that page, we should be driving the relevant standards, not diverging from them, so perhaps we should be pushing for that too). Have you talked to the people working on adding vector types to C++ about this (in particular, Matthias Kretz and Tim Shen; see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0214r9.pdf and http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4454.pdf for current / prior work in this area)?
On the LLVM IR side, I'm personally unconvinced that we should model matrices in the IR directly as a new first-class type, unless there's some target out there that has matrix operations in hardware / matrix registers, but IR is not really my area of expertise so give that opinion as much or little weight as you see fit.
However, I do wonder: how is this different from, say, complex numbers, for which we don't have a native IR representation? (Maybe the answer is that we should have native IR support for complex numbers too.)
How would you expect the frontend to lower (eg) matrix multiplication for matrices of complex numbers?
On Oct 11, 2018, at 3:42 PM, Richard Smith <ric...@metafoo.co.uk> wrote:
On Wed, 10 Oct 2018 at 23:10, Adam Nemet via cfe-dev <cfe...@lists.llvm.org> wrote:
Hi,
We are proposing first-class type support for a new matrix type. This is a natural extension of the current vector type with an extra dimension.
For example, this is what the IR for a matrix multiply would look like for a 4x4 matrix with element type float:
%0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16%1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16%2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1)store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16
Currently we support element-wise binary operations, matrix multiply, matrix-scalar multiply, matrix transpose, extract/insert of an element. Besides the regular full-matrix load and store, we also support loading and storing a matrix as a submatrix of a larger matrix in memory. We are also planning to implement vector-extract/insert and matrix-vector multiply.
All of these are currently implemented as intrinsics. Where applicable we also plan to support these operations with native IR instructions (e.g. add/fadd).
These are exposed in clang via builtins. E.g. the above operations looks like this in C/C++:
typedef float mf4x4_t __attribute__((matrix_type(4, 4)));
mf4x4_t add(mf4x4_t a, mf4x4_t b) {return __builtin_matrix_multiply(a, b);}
This seems to me to be proposing two distinct things -- built-in matrix support in the frontend as a language extension, and support for matrices in LLVM IR -- and I think it makes sense to discuss them at least somewhat separately.
On the Clang side: our general policy for frontend language extensions is described here: http://clang.llvm.org/get_involved.html
I'm totally happy to assume that you can cover most of those points, but point 4 seems likely to be a potential sticking point. Have you talked to WG14 about adding a matrix extension to C? (I'd note that they don't even have a vector language extension yet, but as noted on that page, we should be driving the relevant standards, not diverging from them, so perhaps we should be pushing for that too). Have you talked to the people working on adding vector types to C++ about this (in particular, Matthias Kretz and Tim Shen; see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0214r9.pdf and http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4454.pdf for current / prior work in this area)?
Yes, we’re certainly interested pushing this into the appropriate language standards. As I was saying in the proposal, right now this is experimental. We see great performance improvement with this but we’d like to get feedback and data from the wider community and also continue to improve the implementation. Also the surface area for this extension is pretty minimal and piggybacks vector. As you’ve seen in the thread there is a diverse set of people from GPU to CPU expressing interests and preferences. My goal was to put this out there and have people collaborate and then propose what’s actually working. Obviously ABI is a big consideration here and I don’t want to design in a vacuum. Would this model work for you?
I also strongly believe that the right approach for good matrix performance is gradual lowering rather than lowering all the way to loops and then trying to recover the original intent of the program which is what the above approach takes.
On the LLVM IR side, I'm personally unconvinced that we should model matrices in the IR directly as a new first-class type, unless there's some target out there that has matrix operations in hardware / matrix registers, but IR is not really my area of expertise so give that opinion as much or little weight as you see fit.
Since people already spoke up about direct HW needs, you’re probably satisfied here but as I said there are other benefits. For a chain of matrix operations it is very beneficial to fuse the operations and then have it operate on tiles of the matrices at a time. You don’t need to go to very large matrices at all to start to hit this.
-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory
On Oct 11, 2018, at 3:42 PM, Richard Smith <ric...@metafoo.co.uk> wrote:On Wed, 10 Oct 2018 at 23:10, Adam Nemet via cfe-dev <cfe...@lists.llvm.org> wrote:Hi,We are proposing first-class type support for a new matrix type. This is a natural extension of the current vector type with an extra dimension.
For example, this is what the IR for a matrix multiply would look like for a 4x4 matrix with element type float:%0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16%1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16%2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1)store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16
Currently we support element-wise binary operations, matrix multiply, matrix-scalar multiply, matrix transpose, extract/insert of an element. Besides the regular full-matrix load and store, we also support loading and storing a matrix as a submatrix of a larger matrix in memory. We are also planning to implement vector-extract/insert and matrix-vector multiply.
All of these are currently implemented as intrinsics. Where applicable we also plan to support these operations with native IR instructions (e.g. add/fadd).
These are exposed in clang via builtins. E.g. the above operations looks like this in C/C++:typedef float mf4x4_t __attribute__((matrix_type(4, 4)));mf4x4_t add(mf4x4_t a, mf4x4_t b) {return __builtin_matrix_multiply(a, b);}This seems to me to be proposing two distinct things -- built-in matrix support in the frontend as a language extension, and support for matrices in LLVM IR -- and I think it makes sense to discuss them at least somewhat separately.On the Clang side: our general policy for frontend language extensions is described here: http://clang.llvm.org/get_involved.htmlI'm totally happy to assume that you can cover most of those points, but point 4 seems likely to be a potential sticking point. Have you talked to WG14 about adding a matrix extension to C? (I'd note that they don't even have a vector language extension yet, but as noted on that page, we should be driving the relevant standards, not diverging from them, so perhaps we should be pushing for that too). Have you talked to the people working on adding vector types to C++ about this (in particular, Matthias Kretz and Tim Shen; see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0214r9.pdf and http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4454.pdf for current / prior work in this area)?Yes, we’re certainly interested pushing this into the appropriate language standards. As I was saying in the proposal, right now this is experimental. We see great performance improvement with this but we’d like to get feedback and data from the wider community and also continue to improve the implementation. Also the surface area for this extension is pretty minimal and piggybacks vector. As you’ve seen in the thread there is a diverse set of people from GPU to CPU expressing interests and preferences. My goal was to put this out there and have people collaborate and then propose what’s actually working. Obviously ABI is a big consideration here and I don’t want to design in a vacuum. Would this model work for you?
I also strongly believe that the right approach for good matrix performance is gradual lowering rather than lowering all the way to loops and then trying to recover the original intent of the program which is what the above approach takes.On the LLVM IR side, I'm personally unconvinced that we should model matrices in the IR directly as a new first-class type, unless there's some target out there that has matrix operations in hardware / matrix registers, but IR is not really my area of expertise so give that opinion as much or little weight as you see fit.Since people already spoke up about direct HW needs, you’re probably satisfied here
but as I said there are other benefits. For a chain of matrix operations it is very beneficial to fuse the operations and then have it operate on tiles of the matrices at a time. You don’t need to go to very large matrices at all to start to hit this.
However, I do wonder: how is this different from, say, complex numbers, for which we don't have a native IR representation? (Maybe the answer is that we should have native IR support for complex numbers too.)That’s hard for me to answer as I didn’t look at complex-number performance.How would you expect the frontend to lower (eg) matrix multiplication for matrices of complex numbers?As you say, complex numbers are not a first-class type so this is not supported just like it’s not for vectors, i.e. you’d have to use an array of complex numbers. That said this may tip the scales to introduce complex numbers as well.
I think there are a few potential problems with adding these into LLVM IR as a first-class type:
- With vectors, we have a simple extension from scalar operations to vector ones. A scalar operation on a vector is exactly the same as the same scalar operation applied pairwise to the elements. This is not the case for matrix types.
- A lot of analysis and transform passes need to have special cases for vectors. This is already a bit of a maintenance problem, but one that we put up with because of the big wins from vector types. Do matrix types have the same tradeoff?
- What does a matrix of pointers look like? What is a GEP on a matrix of pointers? If I do GEP on a matrix of pointers and a matrix of integers and I replace it with ptrtoint, add, inttoptr, is that equivalent?
- Does the IR representation enforce an in-memory representation? If different targets represent matrixes in different orders (e.g. column-major vs row-major) then how much do optimisers need to be aware of this?
- Can all of the more complex operations that you might want to implement on matrixes be implemented in terms of a smaller number of primitives, without resorting to long extract and insert element sequences?
- Can the same optimisations be used on sparse matrix representations that are then lowered by something custom nearer the back end?
- As others have said, how does this interact with variable-sized matrixes?
Is this something that we could get if we had better loop transformations (i.e., loop fusion)?
> However, I do wonder: how is this different from, say, complex
> numbers, for which we don't have a native IR representation? (Maybe
> the answer is that we should have native IR support for complex
> numbers too.)
We've got at least one person here with decades of experience handling
complex operations in compilers who thinks LLVM IR should absolutely
have native support for complex, FWIW. I don't have enough experience
with it myself to make a definitive statement but I can certainly see
advantages to it.
Most of us here also think that LLVM IR should absolutely have native
support for predication/masking. :)
As for matrices, I again can see the advantages but have no practical
experience to draw upon. But some people at Apple seem to think it's
advantageous and I'm interested in learning more.
-David
> Is this something that we could get if we had better loop
> transformations (i.e., loop fusion)?
>
> As we both know, aliasing usually makes this pretty difficult. For
> relatively small matrices run-time memchecks can be costly. Also we
> need to be able to evaluate these fusion opportunities in the inliner
> so having first-class type representation makes this cheap.
Fortran defines away many of these problems. Would better local
restrict support help with C-family languages? There are several old
patches from Hal that need some attention.
On Oct 12, 2018, at 10:36 AM, David Greene via cfe-dev <cfe...@lists.llvm.org> wrote:Adam Nemet via llvm-dev <llvm...@lists.llvm.org> writes:Is this something that we could get if we had better loop
transformations (i.e., loop fusion)?
As we both know, aliasing usually makes this pretty difficult. For
relatively small matrices run-time memchecks can be costly. Also we
need to be able to evaluate these fusion opportunities in the inliner
so having first-class type representation makes this cheap.
Fortran defines away many of these problems. Would better local
restrict support help with C-family languages? There are several old
patches from Hal that need some attention.
JF Bastien via llvm-dev <llvm...@lists.llvm.org> writes:
> Agreed these patches would be neat to revive.
We've been trying but haven't got responses to Phab comments we've
posted.
> I think we’d also want someone to pursue wg21.link/n4150.
Definitely!
> However, none of that seems like it should gate matrix support in the
> IR.
Agreed. But we should at least be aware of alternatives so we have a
good sense of the pros/cons of the proposal.
-David
_______________________________________________
Richard Smith via cfe-dev <cfe...@lists.llvm.org> writes:
> However, I do wonder: how is this different from, say, complex
> numbers, for which we don't have a native IR representation? (Maybe
> the answer is that we should have native IR support for complex
> numbers too.)
We've got at least one person here with decades of experience handling
complex operations in compilers who thinks LLVM IR should absolutely
have native support for complex, FWIW. I don't have enough experience
with it myself to make a definitive statement but I can certainly see
advantages to it.
Most of us here also think that LLVM IR should absolutely have native
support for predication/masking. :)
On the LLVM IR side, I'm personally unconvinced that we should model matrices in the IR directly as a new first-class type, unless there's some target out there that has matrix operations in hardware / matrix registers, but IR is not really my area of expertise so give that opinion as much or little weight as you see fit.
> On Oct 12, 2018, at 2:02 AM, David Chisnall <David.C...@cl.cam.ac.uk> wrote:
>
> On 11 Oct 2018, at 23:42, Richard Smith via cfe-dev <cfe...@lists.llvm.org> wrote:
>>
>> On the LLVM IR side, I'm personally unconvinced that we should model matrices in the IR directly as a new first-class type, unless there's some target out there that has matrix operations in hardware / matrix registers, but IR is not really my area of expertise so give that opinion as much or little weight as you see fit. However, I do wonder: how is this different from, say, complex numbers, for which we don't have a native IR representation? (Maybe the answer is that we should have native IR support for complex numbers too.) How would you expect the frontend to lower (eg) matrix multiplication for matrices of complex numbers?
>
> I think there are a few potential problems with adding these into LLVM IR as a first-class type:
>
> - With vectors, we have a simple extension from scalar operations to vector ones. A scalar operation on a vector is exactly the same as the same scalar operation applied pairwise to the elements. This is not the case for matrix types.
Vectors also have things like reductions, masked load/store and scatter/gather. Matrices also have element-wise operations and then some custom ones. I don’t think the situation is all that different.
>
> - A lot of analysis and transform passes need to have special cases for vectors. This is already a bit of a maintenance problem, but one that we put up with because of the big wins from vector types. Do matrix types have the same tradeoff?
I don’t think that’s the right way to look at this proposal. As I said I think it’s better to look at matrices as an extension to vectors adding a different view on the same underlying data. Thus most of the special cases for vectors would also apply to matrices as considered SequentialTypes or a new common base class. So effectively isVector becomes isVectorOrMatrix in many cases. This has definitely been the experience so far in the prototype.
With this extension we can cover a new class of applications that otherwise would require heroic front-end or user-level efforts. And since all these approaches perform premature lowering, enabler passes like the inliner can’t reason about the representation. Leaving them as first-class constructs makes this reasoning trivial.
>
> - What does a matrix of pointers look like? What is a GEP on a matrix of pointers? If I do GEP on a matrix of pointers and a matrix of integers and I replace it with ptrtoint, add, inttoptr, is that equivalent?
Unlike vectors, I don’t think that’s a case that we want to support unless one of the HW implementations requires this. Do you think this would be useful?
>
> - Does the IR representation enforce an in-memory representation? If different targets represent matrixes in different orders (e.g. column-major vs row-major) then how much do optimisers need to be aware of this?
Most of this should be limited to the lowering pass. That is where we would reason about the layout, fusion, register pressure and vectorization.
How much InstCombine and things like that would be affected we can decide based on the trade-off of the specific cases we want to optimize. I don’t think that’s a priority or a requirement for now.
I feel that having this extension experimental would allow us to evaluate pros and cons in this regard.
>
> - Can all of the more complex operations that you might want to implement on matrixes be implemented in terms of a smaller number of primitives, without resorting to long extract and insert element sequences?
I’d like to keep the canonical representation for these operations short so that high-level optimizations like the inliner can easily reason about them.
>
> - Can the same optimisations be used on sparse matrix representations that are then lowered by something custom nearer the back end?
I feel this is related to your scatter/gather question but I am not sure I completely get it. Can you please elaborate?
>
> - As others have said, how does this interact with variable-sized matrixes?
I’d like to experiment with lowering variable-sized matrix operations to this representation. I have some hand-wavy ideas about this. Something where we would have a lambda expressing operations on fixed-sized tiles and then a mapping function that applies/extends that to variable-sized matrices. An expression-template library may be able to gather all the required pieces for this.
Adam
Hi,We are proposing first-class type support for a new matrix type.
This is a natural extension of the current vector type with an extra dimension.
For example, this is what the IR for a matrix multiply would look like for a 4x4 matrix with element type float:%0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16%1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16%2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1)store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16
Currently we support element-wise binary operations, matrix multiply, matrix-scalar multiply, matrix transpose, extract/insert of an element. Besides the regular full-matrix load and store, we also support loading and storing a matrix as a submatrix of a larger matrix in memory. We are also planning to implement vector-extract/insert and matrix-vector multiply.
All of these are currently implemented as intrinsics. Where applicable we also plan to support these operations with native IR instructions (e.g. add/fadd).
These are exposed in clang via builtins. E.g. the above operations looks like this in C/C++:typedef float mf4x4_t __attribute__((matrix_type(4, 4)));mf4x4_t add(mf4x4_t a, mf4x4_t b) {return __builtin_matrix_multiply(a, b);}
** Benefits **
Having matrices represented as IR values allows for the usual algebraic and redundancy optimizations. But most importantly, by lifting memory aliasing concerns, we can guarantee vectorization to target-specific vectors.
Having a matrix-multiply intrinsic also allows using FMA regardless of the optimization level which is the usual sticking point with adopting FP-contraction.
Adding a new dedicated first-class type has several advantages over mapping them directly to existing IR types like vectors in the front end. Matrices have the unique requirement that both rows and columns need to be accessed efficiently. By retaining the original type, we can analyze entire chains of operations (after inlining) and determine the most efficient intermediate layout for the matrix values between ABI observable points (calls, memory operations).
The resulting intermediate layout could be something like a single vector spanning the entire matrix or a set of vectors and scalars representing individual rows/columns. This is beneficial for example because rows/columns would be aligned to the HW vector boundary (e.g. for a 3x3 matrix).
The layout could also be made up of tiles/submatrices of the matrix. This is an important case for us to fight register pressure. Rather than loading entire matrices into registers it lets us load only parts of the input matrix at a time in order to compute some part of the output matrix. Since this transformation reorders memory operations, we may also need to emit run-time alias checks.
Having a dedicated first-class type also allows for dedicated target-specific ABIs for matrixes. This is a pretty rich area for matrices. It includes whether the matrix is stored row-major or column-major order. Whether there is padding between rows/columns. When and how matrices are passed in argument registers. Providing flexibility on the ABI side was critical for the adoption of the new type at Apple.
Having all this knowledge at the IR level means that front-ends are able to share the complexities of the implementation. They just map their matrix type to the IR type and the builtins to intrinsics.
At Apple, we also need to support interoperability between row-major and column-major layout. Since conversion between the two layouts is costly, they should be separate types requiring explicit instructions to convert between them. Extending the new type to include the order makes tracking the format easy and allows finding optimal conversion points.
** Roll-out and Maintenance **
Since this will be experimental for some time, I am planning to put this behind a flag: -fenable-experimental-matrix-type. ABI and intrinsic compatibility won’t be guaranteed initially until we lift the experimental status.
For those of you at the Dev Meeting interested in this topic, we’re organizing a round table at 3:30 on Thursday.
See you there.
Adam
> cfe-dev mailing list
> cfe...@lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
The reason I’m opposed to a matrix *type* is that this is far too specific of a concept to put into LLVM. We don’t even have signedness of integers in the type system: the instruction set is the major load bearing part of the IR design, and the instruction set is extensible through intrinsics.
Arguing in favor of #1: AFAICT, you only need to add the new intrinsics to do matmul etc. You could just define them to take 1D vectors but apply math to them that interprets them as a 2D space. This is completely an IR level modeling issue, and would be a very non-invasive patch. You’re literally just adding a few intrinsics. All the pointwise operations and insert/extract/shuffles will “just work”. The frontend handles mapping 2d indices to 1D indices.
Adding a new dedicated first-class type has several advantages over mapping them directly to existing IR types like vectors in the front end. Matrices have the unique requirement that both rows and columns need to be accessed efficiently. By retaining the original type, we can analyze entire chains of operations (after inlining) and determine the most efficient intermediate layout for the matrix values between ABI observable points (calls, memory operations).I don’t understand this point at all.
I imagine that the information about the ABI would have been done
solely by the front-end, which includes compile-time constant padding
on all dimensions, making it feasible to add offsets when iterating
through matrix rows/cols in 1D.
If the padding is different or badly aligned, it would be the same
cost for the front-end to lower the right adds to induction at the
right time as it would be for the middle/back-end, but we already have
(all?) such decisions being made by the front-ends.
Even if we would try to simplify the IR codegen, to expose more
parallelism, the change could be an IR annotation on a new type of
scalar evolution (with arbitrary padding at arbitrary times), and not
necessarily a new type.
Someone mentioned reductions, but we already have reduction intrinsics
that deal with that:
https://llvm.org/docs/LangRef.html#experimental-vector-reduction-intrinsics
Perhaps just adding more of those would satisfy all needs?
With regards to aliasing, couldn't the front-end just force no alias
when the "source code type" is guarantee to not alias, like Fortran?
If we carry that through the optimisation passes, then it should be
easy to not add unnecessary runtime checks.
Finally, I do agree that trickling all of that matrix-specific
knowledge to generic passes like the loop vectoriser, inliner, etc
would be perhaps too big and imperfect, but I haven't got it why a new
type would fix this, other than having to teach all passes about the
new type (which is the same cost as intrinsics).
An alternative would be a new, simpler pass, or even a simplified
VPlan, based on a quick analysis to check such patterns, guarded by
the existence of "matrix attributes" having ever being set by the
front-end, for example (just guessing here).
--
cheers,
--renato
Thanks to everybody who attended the round table and everybody who commented here or discussed this with me at the Dev Meeting. As the next step, I’d like to contrast the main alternatives (flattened vector with shape info on intrinsics vs. N-dimensional vector) in more detail with pros and cons and circulate it here.
Adam
-Chris
On Oct 24, 2018, at 8:33 AM, Adam Nemet <ane...@apple.com> wrote:
However, the argument seems to imply that a vector type like <16 x i32> can't do so. In the favor of option #1, I argue that the plain <16 x i32> enables the same optimization opportunities, as long as the uses are not on ABI boundaries.Adam and I discussed this at the devmtg, and indeed his idea is to have a “codegen prepare” sort of pass that does some amount of pre-legalization of matrices (which should also be applicable to large vectors) with the goal of reducing register pressure etc.Adam, can you please summarize the discussions you had and what you see as the next steps here? Thanks!I’d like to write up the main alternatives (flattened vector + shape/layout-aware intrinsics vs. N-dimensional vector) and contrast them with IR at the various stages.I am busy with some internal stuff at the moment but hoping to get to this next week.