Tile / micro-kernel dialects

586 views
Skip to first unread message

Renato Golin

unread,
May 5, 2023, 8:39:29 AM5/5/23
to iree-discuss
Hi,

As a follow up from the community meeting this week, I wanted to start the discussion of the two dialects we seem to be building in parallel:
  1. A tile dialect (tpp for us, vmvx in IREE)
  2. A micro-kernel dispatch dialect (xsmm for us, ukernel for IREE)
We have a more evolved design, because this is our core research, but it's also very much biased to the library we created (libxsmm) and so is probably not very representative of all other libraries that IREE and other MLIR users need.

Our dialect documents are a little outdated but they still give the general gist of what we're trying to do:
Note: TPP on tensors we'll use DPS to match other dialects.

Our vision is that we need to separate the layers into a list of canonical representations, from ingress to hardware dispatch, across multiple MLIR compilers, to focus on appropriate semantics at the right level.

We don't want to have our own list and it doesn't help if IREE has its own either, as we can't use it (or any other non-upstream project), otherwise we'd be stuck with a single framework, and upstreaming those dialects would fragment the overall MLIR design, not unify it.

We'd like to bring this to a wider audience (the rest of MLIR community) for discussion, but first, since we've been working with IREE and our needs are very much aligned (from feedback on the meeting this week), we'd like to use IREE's experience to bring together to LLVM an RFC that has a higher chance to resonate with all other groups.

A typical pipeline would be:
  1. Ingress: HLO, Torch, TOSA, ...
  2. High-Level: Linalg + NamedOps(*) [1]
  3. Pass: Block/Tile/Fuse @ tensor
  4. Tile level: { tpp, vmvx } -> Tile(*) @ tensor [2]
  5. Bufferization
  6. Tile level: { tpp, vmvx } -> Tile(*) @ memref
  7. Pass: Combine/Fuse/Reorder/Strength-Reduce @ memref
  8. Micro-kernel level: { xsmm, ukernel } -> UKernel(*)
  9. Pass: Hoisting/DCE/CSE/Canonicalization
  10. Lowering: SPIRV, LLVM, etc.
(*) Those are the places where we think there's scope for new dialects.
[1] This is probably the TCP dialect?
[2] We currently have tpp @ tensor level to make some passes simpler (not depend on address analysis for tile op fusion). This isn't mandatory in our design, but it is an important part of it.

Basically, once it gets to micro-kernel dispatch, it's really hard to do fusion and grouping, accumulation reordering etc. so we need an intermediate state between linalg and ukernel. This is our TPP dialect, that are simply operations where the "data type" is a tile.

The size of the tile and the order in which these ops are called is up to the compiler (and support in the library).

How to bundle these calls into a macro-kernel depends on the device. On CPUs, one can use OpenMP or a smart scheduler. On GPUs, one can fuse into a single kernel and dispatch to the device.

Here, high-level, tile-level and ukernel-level ops can co-exist at any given time (modulo bufferization issues), and the lower passes will simply ignore ops that are not its input ops.

This offers a lot of flexibility:
  • A GPU lowering that doesn't have support for tile micro-kernels can replace the high-level named ops directly into calls (every other following pass ignores it).
  • A CPU lowering that uses hand-crafted kernels (ex. OneDNN) can do the same.
  • An accelerator device that has MLIR compiler passes can let the framework compile to tile ops, then run their passes and lower to micro-kernel calls on their own.
  • A CPU lowering that receives profile/trace/super-optimizer information can generate the loops as instructed, lowering to micro-kernel calls for each tile.
  • While at tile/tensor level, it's easier to see that inputs and outputs are the same, or unused, and make decisions about in-place vs out-of-place, kernel fusion, etc.
  • Compilers that target multiple libraries (ex. OneDNN and XSMM) can bundle/split tile calls into kernel calls and vice versa to pick the optimal macro/micro kernel call to make before it gets to function calls.
In summary, we want to find a flexible path through MLIR, using dialects at the right level, where all compilers can rely on the semantics of the ops to do the transforms at the right level, and not have to rely on scalar evolution, alias analysis, liveness analysis, etc. to find optimal lowering patterns.

We also want to allow compilers to combine with other compilers/frameworks, and "talk" to each other through these strong semantics dialects, allowing them to transform the IR without needing to "understand" some third-party dialect.

Thanks!
Renato

Mahesh Ravishankar

unread,
May 5, 2023, 2:46:33 PM5/5/23
to Renato Golin, iree-discuss
Thanks Renato,

There are a lot of things that this brings up and might need to be pieced apart. It might be easier to move things upstream in MLIR piece-by-piece instead of a big dump. The least common denominator I see here is the ukernel lowering. In IREE there are four parts to our microkernel hand-off

1) An Interface for micro kernel ops that helps define what ABI that micro kernel lowers into when lowering to a function call : https://github.com/openxla/iree/blob/main/compiler/src/iree/compiler/Codegen/Interfaces/UKernelOpInterface.td
2) Currently we have a single operation that implements this interface : https://github.com/openxla/iree/blob/main/compiler/src/iree/compiler/Codegen/Interfaces/UKernelOpInterface.td
3) Then the pass that just lower all operations that implement the interface into a function call.
4) The pass that matches a DAG of operations to be replaced by a single ukernel operation.

The only difference from the layering above is that the micro kernel operation operates on both tensors and memrefs and is in destination passing style. The reason for the tensor based op is that it is easier to do matches (i.e. fusion) on tensors than on memrefs. So in terms of difference with your pipeline, it would look something like this 

    1. Ingress: HLO, Torch, TOSA, ...
    2. High-Level: Linalg + NamedOps(*) [1]
    3. Pass: Block/Tile/Fuse @ tensor
    4. Tile level: { tpp, vmvx } -> Tile(*) @ tensor [2
    1. Micro-kernel level: { xsmm, ukernel } -> UKernel(*)
    1. Bufferization
    2. Lowering: SPIRV, LLVM, etc.
    I started a working doc here with Lorenzo to align your use case to target libxsmm and current IREEs use case and propose and RFC upstream. My current thinking is that upstream we would have to 
    a. Create a new dialect called DPS dialect for ops that are destination passing style (just a thought for a place to host the op upstream, but I have not thought more about the scope and naming of this dialect)
    b. Add an interface to the House micro kernel ops. I think interface is key cause there is no single operation that will satisfy all ABI needs, and having the separation through interface allows it to be extensible (and potentially used outside of the tree based on need)
    c. We can move some version of the current IREEs micro_kernel op and potentially more ops to fit your use case upstream. As long as they all implement the interface, it provides a good separation of concerns.

    This covers 3 of the 4 aspects of what is there in IREE today. The pass that matches a DAG of operations to be replaced is something that probably is better to live in user code (i.e. IREE).

    W.R.T to the other aspects of TPP, that I dont have a strong opinion on it. For what it's worth, we were thinking of removing almost all operations from the VMVX dialect since there really is no need for those now. They could all go through the microkernel operation and call different implementations. We can maybe decouple that part from the ukernel hand off parts for now cause the latter seems immediately actionable. 


    --
    You received this message because you are subscribed to the Google Groups "iree-discuss" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to iree-discuss...@googlegroups.com.
    To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/CAPH-gfdRm_84g2tROy4y6zUric13KMYFBka10%2Bm8b%3DYxvk_PcA%40mail.gmail.com.


    --
    Mahesh

    Renato Golin

    unread,
    May 6, 2023, 9:42:28 AM5/6/23
    to Mahesh Ravishankar, iree-discuss
    On Fri, 5 May 2023 at 19:46, Mahesh Ravishankar <ravish...@google.com> wrote:
    The only difference from the layering above is that the micro kernel operation operates on both tensors and memrefs and is in destination passing style. The reason for the tensor based op is that it is easier to do matches (i.e. fusion) on tensors than on memrefs.

    I think we're talking about different things but using the same terminology.

    You seem to be focusing on implementation specific lowering to target specific hardware/library pairs for IREE. We're focusing on creating a generic high-level tile instruction set that can be shared across multiple implementations for multiple architectures.

    Both involve transforms and calling micro-kernels, but I think there's where the similarities end.

    There are two distinct stages on our proposal:
    1. A semantically strong dialect in which to do high-level compiler transformations.
    2. A simple dialect to carry micro-kernel information that depends on the implementation semantics. 
    You cannot expect the micro-kernel dialect to carry generic semantics (powerful enough for compiler transformations) at the same time being tied to the implementations semantics (arguments, flags, return values).

    The micro-kernel dialect cannot also be used as a tile operation dialect, because it's just a function call abstraction.

    If you use this dialect to anything other than calling functions, you're giving semantics to function calls, which is going back to LLVM IR intrinsics, then what's the point of MLIR?

    So in terms of difference with your pipeline, it would look something like this 
    1. Ingress: HLO, Torch, TOSA, ...
    2. High-Level: Linalg + NamedOps(*) [1]
    3. Pass: Block/Tile/Fuse @ tensor
    4. Tile level: { tpp, vmvx } -> Tile(*) @ tensor [2
    5. Micro-kernel level: { xsmm, ukernel } -> UKernel(*)
    6. Bufferization
    This makes no sense to me. It seems you only put the uKernel dialect before bufferization because you also have it on tensors, not because it would make any sense in my example pipeline.

    After looking at the code and reading the issues in Github about the uKernel dialect, my impression is that where we have two dialects, you have one. You're even trying to replace vmvx with a generic micro-kernel call.

    To me, that's exactly like intrinsics in LLVM and I see no value in it. The whole point of MLIR's staggered lowering and multiple dialects is that you can carry semantics across the passes, and with uKernel you're immediately flattening it and tying any tensor, bufferiztion and memref transformations to the semantics of the implementation (the library that you're calling), not just the architecture.

    The TPP abstraction, and what I assumed was the vmvx/mmt4d abstraction too, is that we can run generic passes at tensor level, independent of the implementation (but using target info as a guide) and then be able to pick the implementation at the last possible moment.

    If you're going to all this trouble to lower to MLIR only to force a back-end at the first opportunity, then why not use another PJRT back-end in the first place?
     
    I started a working doc here with Lorenzo to align your use case to target libxsmm and current IREEs use case and propose and RFC upstream. My current thinking is that upstream we would have to 
    a. Create a new dialect called DPS dialect for ops that are destination passing style (just a thought for a place to host the op upstream, but I have not thought more about the scope and naming of this dialect)
    b. Add an interface to the House micro kernel ops. I think interface is key cause there is no single operation that will satisfy all ABI needs, and having the separation through interface allows it to be extensible (and potentially used outside of the tree based on need)
    c. We can move some version of the current IREEs micro_kernel op and potentially more ops to fit your use case upstream. As long as they all implement the interface, it provides a good separation of concerns.

    Your interface doesn't have enough expressivity to cater to the needs of our micro-kernels, and it wouldn't make sense for us to raise out micro-kernel dialect to tensor.

    While designing a good dialect is an important part of our research, we cannot justify spending all our time creating a never ending list of alternatives.

    W.R.T to the other aspects of TPP, that I dont have a strong opinion on it. For what it's worth, we were thinking of removing almost all operations from the VMVX dialect since there really is no need for those now. They could all go through the microkernel operation and call different implementations. We can maybe decouple that part from the ukernel hand off parts for now cause the latter seems immediately actionable. 

    As I said above, this makes no sense to me, and it seems to be a particular implementation of IREE that doesn't have a huge value upstream.

    I can see value upstream for a micro-kernel memref dialect which only purpose is to simplify function calling.

    I can see value upstream for multiple smaller tensor dialects carrying strong semantics, form ingress, through linalg, to tile level, but that are not tied to any particular implementation.

    The lowering between these two worlds is largely downstream business, on how you adapt the micro-kernel dialect to your implementations and how you use your cost models to make decisions.

    A mix of these two worlds will bring us back to LLVM IR intrinsics and I'd be strongly opposed to that move.

    Renato

    Stella Laurenzo

    unread,
    May 6, 2023, 1:15:40 PM5/6/23
    to Renato Golin, Mahesh Ravishankar, iree-discuss


    On Sat, May 6, 2023, 6:42 AM Renato Golin <reng...@gmail.com> wrote:
    On Fri, 5 May 2023 at 19:46, Mahesh Ravishankar <ravish...@google.com> wrote:
    The only difference from the layering above is that the micro kernel operation operates on both tensors and memrefs and is in destination passing style. The reason for the tensor based op is that it is easier to do matches (i.e. fusion) on tensors than on memrefs.

    I think we're talking about different things but using the same terminology.

    This is exactly why on discord that I suggested we needed a workgroup of some kind.

    IREE has imposed some additional requirements on itself that don't make the "tiled intrinsic" an academic recasting of LLVM-IR. Much of what is implemented is about getting this capability lit up and working from top to bottom, with the goal of ensuring that it is well defined up to the input level so that opaque tiled microkernels can be expressed by users if desired (this kind of upward creep, we have found, is very common in current frameworks and why we were trying to plan for it). We can talk about why we think this is important as well as some of the other user cases we are trying to address, but I'm not sure it is super salient to this discussion.

    This is then used to implement the present pack/unpack/mmt4d lowerings. When Mahesh was saying that the existing dialects were going to be depopulated, it is because these cases are simple enough that they can just be directly lowered to the mechanism. Whether or not this is the right call, I have not formed an opinion. But to a first approximation, I do support pruning dialects that are not adding much value (or where there may be a better alternative being worked on by others).

    It seems to me that a substantial part of TPP is taking an alternative/complementary viewpoint and creating a well defined dialect for managing this lowering to a wider variety of primitives. The original vmvx work was positing a similar thesis, but you all have clearly put the miles on it in a way that that work never did.

    Further TPP defining the xsmm dialect gives names and structure before bottoming it out on some form of call. IREE has opinions on the form of the call due to the way it likes to handle deployment but I expect this is largely about plumbing the exit from your perspective.

    These things are still in a "complementary" mode in my mind, but my understanding could also be lacking at some point.

    Let's pick this up next week -- preferably by getting more time to discuss. Selfishly, I'd like to avoid a lot more back and forth over the weekend :)

    --
    You received this message because you are subscribed to the Google Groups "iree-discuss" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to iree-discuss...@googlegroups.com.

    Mehdi AMINI

    unread,
    May 6, 2023, 3:13:37 PM5/6/23
    to Renato Golin, Mahesh Ravishankar, iree-discuss
    Hi Renato,

    I love the work your group is doing on uKernels :)

    On Sat, May 6, 2023 at 6:42 AM Renato Golin <reng...@gmail.com> wrote:
    On Fri, 5 May 2023 at 19:46, Mahesh Ravishankar <ravish...@google.com> wrote:
    The only difference from the layering above is that the micro kernel operation operates on both tensors and memrefs and is in destination passing style. The reason for the tensor based op is that it is easier to do matches (i.e. fusion) on tensors than on memrefs.

    I think we're talking about different things but using the same terminology.

    You seem to be focusing on implementation specific lowering to target specific hardware/library pairs for IREE. We're focusing on creating a generic high-level tile instruction set that can be shared across multiple implementations for multiple architectures.

    As I understand it (please correct me!), the reason for the TPP dialect to exist before bufferization is motivated by the need of doing transformations (optimizations) on this dialect, and the tensor domain is much nicer (this is similar to Linalg: it operates on tensor and memref, but tile&fuse is nicer to do on tensors).

    Something unclear to me is what can't be done using Linalg primitives already? That is, couldn't we tile linalg operations, and fuse them, bufferize, and then lower to TPP as an abstract interface to the micro kernels? What makes having TPP important to perform these optimizations? (I watched the recording from the IREE community meeting but didn't get a good feel for this, if you have other pointers!).

    I have other questions about TPP, but they may be borderline to this thread so I'll avoid derailing it!
    I'll be at the MLIR workshop and EuroLLVM next week (I understand you won't be there sadly), if someone working on the TPP is around, I'd love to discuss this more.

    Thanks,

    -- 
    Mehdi


    --
    You received this message because you are subscribed to the Google Groups "iree-discuss" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to iree-discuss...@googlegroups.com.

    Renato Golin

    unread,
    May 6, 2023, 4:30:50 PM5/6/23
    to Mehdi AMINI, Mahesh Ravishankar, iree-discuss
    On Sat, 6 May 2023 at 20:13, Mehdi AMINI <joke...@gmail.com> wrote:
    Hi Renato,

    I love the work your group is doing on uKernels :)

    Thanks! :D

    As I understand it (please correct me!), the reason for the TPP dialect to exist before bufferization is motivated by the need of doing transformations (optimizations) on this dialect, and the tensor domain is much nicer (this is similar to Linalg: it operates on tensor and memref, but tile&fuse is nicer to do on tensors).

    Correct. We even tile in linalg, but convert to TPP soon after to loops + tile TPP to move the "generic" nature of linalg into something more pattern-matching friendly.

    Something unclear to me is what can't be done using Linalg primitives already? That is, couldn't we tile linalg operations, and fuse them, bufferize, and then lower to TPP as an abstract interface to the micro kernels? What makes having TPP important to perform these optimizations? (I watched the recording from the IREE community meeting but didn't get a good feel for this, if you have other pointers!).

    Basically, there's no reason why it _can't_ be done in linalg (or uKernel for that matter). 

    TPP, as a virtual ISA, is trying to encode operation semantics on tensors, and we chose to operate at tile level because of our micro-kernel "nature".

    It's more natural for pattern-matchers and target agnostic to encode the operations directly (add, sub, mul, matmul, maxf) rather than generic regions with scalar ops inside (on one side) and also generic function calls (on the other).

    There have been discussions in MLIR about a "tensor op dialect" as opposed to arith/math to work with tensor/memrefs. We like how powerful linalg is, but the reason it has so many extra ops (convs, gemms) shows that not everything can (or should) be represented in generics. While affine maps can represent a lot of loop structures, it gets very complicated very soon when you deviate from the beaten path.

    Another problem we found is that not all ML/HPC operations are represented equally in linalg+arith. For example, ReLU can be a `maxf(0, x)` or `sub(0, x) + sel()`. This can be efficiently implemented as a tile op in the middle of the GEMM, either as maxf or predication, or who knows how an accelerator might do that. But when you lower to linalg+arith, you have already chosen the implementation and the compiler needs to "guess what you meant", which usually goes wrong.

    This is why we were also big proponents of the TCP dialect. We believe that is a space for TCP, linalg, TPP (or arith/math on tensor/tiles), and uKernel/xsmm to all live in harmony. But we don't want to have to fudge our implementation through linalg and start the guessing game, nor we want to lower too quickly to uKernel and end up with large pattern-matching recipes on arguments and attributes.

    I believe we can eventually ditch our xsmm dialect and use uKernel very easily (at memref level), just making sure it caters to our needs (extra return values, arg/ret semantics, dispatch+invoke). 

    But we can't merge the TPP dialect with uKernel even if we have all we need for xsmm, nor we can (esasily) go back to working on pure linalg for the same reason: the patterns we're trying to match would require a lot of spaghetti code, while (semantically strong) named ops are trivial.

    I have other questions about TPP, but they may be borderline to this thread so I'll avoid derailing it!
    I'll be at the MLIR workshop and EuroLLVM next week (I understand you won't be there sadly), if someone working on the TPP is around, I'd love to discuss this more.

    I believe Lorenzo will be there.

    Feel free to ping me, I'd love to go through our reasoning to make sure we're proposing the best possible solution upstream.

    Thanks!
    Renato 

    Mahesh Ravishankar

    unread,
    May 9, 2023, 12:21:34 AM5/9/23
    to Renato Golin, iree-discuss
    On Sat, May 6, 2023 at 6:42 AM Renato Golin <reng...@gmail.com> wrote:
    On Fri, 5 May 2023 at 19:46, Mahesh Ravishankar <ravish...@google.com> wrote:
    The only difference from the layering above is that the micro kernel operation operates on both tensors and memrefs and is in destination passing style. The reason for the tensor based op is that it is easier to do matches (i.e. fusion) on tensors than on memrefs.

    I think we're talking about different things but using the same terminology.

    You seem to be focusing on implementation specific lowering to target specific hardware/library pairs for IREE. We're focusing on creating a generic high-level tile instruction set that can be shared across multiple implementations for multiple architectures.

    With the micro-kernel work that is in tree in IREE, there are two things we have focused on 
    1) Easily match a DAG of operations that can be forwarded to a micro-kernel.
    2) Managing the ABI so that all function calls do not have to go through the memref ABI.

    The first part is achieved by matching the DAG on tensors, the second part is achieved by having a fixed ABI for each operation that implements the UKernelOpInterface. Being in DPS we can go from op on tensors to op on buffers and lower to function call. So that pretty much sums up what we have in tree.
     

    Both involve transforms and calling micro-kernels, but I think there's where the similarities end.

    There are two distinct stages on our proposal:
    1. A semantically strong dialect in which to do high-level compiler transformations.
    2. A simple dialect to carry micro-kernel information that depends on the implementation semantics. 
    You cannot expect the micro-kernel dialect to carry generic semantics (powerful enough for compiler transformations) at the same time being tied to the implementations semantics (arguments, flags, return values).

    To clarify, we dont have a microkernel dialect. It's just an interface and operation that implements the interface. So the scope is pretty narrow.. that is good enough for whatever our needs are.
     

    The micro-kernel dialect cannot also be used as a tile operation dialect, because it's just a function call abstraction.

    If you use this dialect to anything other than calling functions, you're giving semantics to function calls, which is going back to LLVM IR intrinsics, then what's the point of MLIR?

    So in terms of difference with your pipeline, it would look something like this 
    1. Ingress: HLO, Torch, TOSA, ...
    2. High-Level: Linalg + NamedOps(*) [1]
    3. Pass: Block/Tile/Fuse @ tensor
    4. Tile level: { tpp, vmvx } -> Tile(*) @ tensor [2
    5. Micro-kernel level: { xsmm, ukernel } -> UKernel(*)
    6. Bufferization
    This makes no sense to me. It seems you only put the uKernel dialect before bufferization because you also have it on tensors, not because it would make any sense in my example pipeline.

    The reason why it is on tensors is because matching a DAG that corresponds to a known good micro-kernel implementation is easier to do on tensors. In IREE, we heavily use tensor-based code-generation. After bufferization there are very little transformations that we want to do (matching is especially hard).

    --
    Mahesh

    Mehdi AMINI

    unread,
    May 9, 2023, 6:14:38 AM5/9/23
    to Mahesh Ravishankar, Renato Golin, iree-discuss
    When you match a DAG, what do you replace it with? You have to be able to express operations at the tensor level that matches 1-1 with what the micro-kernels will be at the memref level right? I think that TPP provides this, but I don't know what you're using (any pointer to tests in the repo showing these?).

    -- 
    Mehdi
     
     

    --
    Mahesh

    --
    You received this message because you are subscribed to the Google Groups "iree-discuss" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to iree-discuss...@googlegroups.com.

    Mahesh Ravishankar

    unread,
    May 9, 2023, 10:57:16 PM5/9/23
    to Mehdi AMINI, Renato Golin, iree-discuss
    We replace it with the tensor version of the ukernel op (the op implements DPS and supports both tensors and memrefs just like Linalg does). Yes it matches 1-1... but once matched as "candidate for micro-kernel" there is nothing much that we do with it. It pretty much just bufferizes (easy since it DPS) and then lowers to function call. So in-effect when you capture a DAG as a microkernel op at tensor level, you get a deterministic ABI of the function call it eventually lowers to. The actual ABI depends on the micro kernel operation it is lowered into. Today we only have a single op that covers all our use case. (We used to have two at some point, but we were able to collapse those into a single operation). 

     

    -- 
    Mehdi
     
     

    --
    Mahesh

    --
    You received this message because you are subscribed to the Google Groups "iree-discuss" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to iree-discuss...@googlegroups.com.
    To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/CAArwm2YCrZ7BEF%2BRmOpvec97pq523Q%3D421uC4C6PSNXNNQkTnA%40mail.gmail.com.


    --
    Mahesh

    Renato Golin

    unread,
    Jun 9, 2023, 4:02:02 PM6/9/23
    to iree-discuss
    Apologies for necro-bumping this thread, but I've posted a summary on Discourse of our whole problem statement, not just the particular details of micro-kernels here.

    On Wednesday, 10 May 2023 at 03:57:16 UTC+1 Mahesh Ravishankar wrote:
    We replace it with the tensor version of the ukernel op (the op implements DPS and supports both tensors and memrefs just like Linalg does). Yes it matches 1-1... but once matched as "candidate for micro-kernel" there is nothing much that we do with it. It pretty much just bufferizes (easy since it DPS) and then lowers to function call. So in-effect when you capture a DAG as a microkernel op at tensor level, you get a deterministic ABI of the function call it eventually lowers to. The actual ABI depends on the micro kernel operation it is lowered into. Today we only have a single op that covers all our use case. (We used to have two at some point, but we were able to collapse those into a single operation). 

    The main problem with this approach for us is that we need to do a lot more than just call a micro kernel, and it's not just restricted to ABI problems either.

    We want to fuse those operations (fused-brgemm), reorder the loops (accumulation order), group operations into a larger kernel (equations), group an outer parallel loop into a macro-kernel (GPU offloading).

    We currently have a problem that linalg is not good for pattern matching (need to inspect affine maps, iteration order, ins/outs that may not be used, specific chain of scalar operations inside the basic block) all to match a single "add" or "matmul". 

    Not to mention that the implementation inside the generic's body is defined by some higher lowering that may be different depending on the front-end, so you need to match a huge range of things just to "lift" it to an op, that it could have been named from ingress, and tiled with scf loops, etc.

    IIUC, IREE does not have that problem today because once it decides to lower to micro-kernels, there's an assumption that there's nothing else better to do with it. Whether this is cost-model based or not doesn't matter much, it's about tackling the problems that you can, and this goes in a direction that you assume you can't anymore.

    So in the end, we just seem to have very different problems, that's why using a generic micro-kernel library as a tile interface does not work for us.

    Hope this helps.

    Renato
    Reply all
    Reply to author
    Forward
    0 new messages