--
You received this message because you are subscribed to the Google Groups "iree-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iree-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/CAF245g0p-qEL26TReA%2BOry7oR5MiEJ9XJ7QaJ5EZZ17NoaMpew%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/CAH8pnHZSfnpakvcZqCKHg2GzZh3Kpe_8_ueA6eFfAhjYWWvomw%40mail.gmail.com.
Data tiling performance depends on ukernels, which have limited coverage and can’t be enabled by default.



--
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/CAPW%3DRO3HQrwCMUR-HTMbsgiBhvUDA6PAHCF_S4xvn8v4%3Dog1HQ%40mail.gmail.com.
I have an impression that we are converging towards "one size fits all", and I am not sure whether that's the right approach. Looking at various other frontends (e.g. Clang), it is common to fine tune compiler configuration (e.g. backend configuration) for a particular target. And since every CPU (and every micro-architecture) will exhibit different characteristics, I'd assume that it would require dedicated fine-tuning (and, indeed, different lowering strategies). The numbers that Mahesh has shared are super encouraging (well done!)
, however:
- are the "Arm" numbers for "Arm NEON" or for "Arm SVE",
- are these numbers with or without u-kernels (mostly curious how much code-gen contributes to this)?
In the case of "Arm" I think that it's important to make a distinction into NEON and SVE - these are very _very_ different SIMD extensions. We are focusing our efforts on SVE, but I don't quite know what the status of data tiling is in that area (we have only recently enabled scalable vectorisation). In particular, I would rather avoid telling our users to disable data tiling because it doesn't work or doesn't offer the performance gains that they are promised. Having said, I appreciate that there is no SVE in public CI, which makes us a "downstream" user at this point.
--
You received this message because you are subscribed to the Google Groups "iree-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iree-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/cdae4878-71ab-4bd7-b39a-87aa026ec992n%40googlegroups.com.
Thanks for providing more context and sharing more about your use cases! I think we all agree on the value that data tiling brings into the table so I'm optimistic that we can move this forward if we can provide the right level of optionality and composability with the rest of optimization strategies.Regarding the stability and performance issues, perhaps we could open an Epic and start tracking things there? I would add issues #15061, #15132, #15027 (very useful for profiling large models but low prio). I'll check if we have other issues tracked internally. I can also revisit the performance regression with current ToT and help with a reproducer (revisiting my chat log with Mahesh, there seemed to be dispatches performing tensor copies in isolation at flow level).I also agree with Andrzej and Dumitru on fine-grained optionality requirements. We have to make sure data tiling integrates well with the rest of the compiler and composes with other strategies. Otherwise, with an all-or-nothing approach we would be putting the bar too high for enabling targets or specific ISAs that are left behind in the first enablement round. Again, not asking anything new here, that's what we have been doing for other optimization strategies. Implementation-wise, perhaps we could just always add the encoding and decide later in the compiler if the transformation is materialized or not based on target info and op information... Would that make sense?
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/CADpnGYFn6CzuQaQdEK98VdCsmf4GoXuUsDzUC_oozokW_Lfz%2BQ%40mail.gmail.com.
> We are only going to be adding the encoding to operations that work well with data tiling + architectures that are supported. So its already being enabled only where it is expected to work well AFAICS.Thanks! Do you have a pointer to this? I thought hardware information was not available at this level in the compiler.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/CAF245g39qN-P4dz3rzEdvXwAOk60X57QK4eY%3D%2B76TygOgUq7QQ%40mail.gmail.com.
> We should definitely sync up on this.
+1 This keeps being delayed at our end, so apologies for not engaging more.
> When establishing such defaults we need to be following the data and selecting strategies that have good common case fan out.
Yes, and the data points at X86 and Arm NEON :)
> This just sounds like a case of people all being eager for the future where this proposal is about a step in the present.SVE and SME are very much the "present" for us :)
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/cf9a6f9e-7ead-46df-ac32-10230855631cn%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/CABPCc9DgjKm4Ds_p__kjFa%3DKPPhSy_ggbFy9jT3K5QSw09ti%3DA%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/CADpnGYH4%3DgD4_rJvGr%2BG-tYGQ6e8MA-2sX9MrZgbd9DjxxMgEA%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/CAN9jQRV3ZuC0kmpnifAY720W2-AtWigd2K%2BH5%3Dp7-OHcHBzbOg%40mail.gmail.com.
Diego, Andrzej, Benoit, Ben and I had an offline discussion this morning, and came up with other ideas to proceed. The pain point is that we have seen values about data-tiling, but not all the targets are ready for data tiling. We want to keep enough optionality to other targets, so people can keep contributing without passing specical flags.The prototype was (1) query target information at the SetEncoding stage, and (2) determine if we wanna insert set/unset_encoding ops or not. Ben pointed out that we should have better isolation between setting encodings and querying whole target information, which is a very valid point to me. The logic of SetEncoding should not be entangled with the actual target. And the big missing thing is that we can only materialize encodings for limited targets. IMHO, we should complete the functionality of data-tiling for other CPU targets and IREE backends. We can have a default materialization method, which basically can undo set_/unset_encodings if they are not ready for data-tiling. This will be the default materialization pattern for other CPU targets and IREE backends. Thankfully we are able to early materialize encodings if there is a single target (i.e., not heterogeneous computing). For targets that haven't implemented data-tiling, they can go with the original graph without any special flags. The graph does not change, so it does not turn off fusion opportunities.The new proposal is- Implement default materialization pass and use it for other CPU targets and IREE backends.- Set encodings on matmul and batch_matmul. This will be controlled by a flag (which is on by default).For x86 CPUs and Arm NEON CPUs, we will be able to set encoding and early materialize encodings in the GlobalOptimization phase. If the control flag is off, we fallback to today's solution. For other CPU targets (like RISC-V, Arm SVE, etc) and other IREE backends, the set/unset_encodings disappear in the early materialization phase. Because they will use the default materialization method. It gives us better isolation because encoding materialization logic only exists in each backend. We do not have to expose the target information to SetEncoding itself.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/CABPCc9Aj4tSQsOx4qyCAqErG4%3DzKkMic63vv9UTPY8%3DQ6JjP2Q%40mail.gmail.com.
Hanhan> Ben pointed out that we should have better isolation between setting encodings and querying whole target information,Diego> Thanks! Do you have a pointer to this? I thought hardware information was not available at this level in the compiler.
Hanhan> We can have a default materialization method, which basically can undo set_/unset_encodings if they are not ready for data-tiling.Diego> Implementation-wise, perhaps we could just always add the encoding and decide later in the compiler if the transformation is materialized or not based on target info and op information... Would that make sense?
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/73d80f82-629e-4254-b137-248cf8e63149n%40googlegroups.com.
RE: Diego> Thanks! Do you have a pointer to this? I thought hardware information was not available at this level in the compiler.
It *shouldn't be* available, and if it is then it's tech debt. We'll push back against hardware information being used at that level in the core compiler - anything using hardware information at the frontend level is directly equivalent to `with torch.cuda.device(1):` kind of stuff in python and it's really bad for the health of the compiler infra. Users will need to pass flags or inject configuration into the inputs to the compiler to make use of such information as then it's a direct user choice to break core compiler functionality like multi-targeting and heterogeneous execution, just as when a user puts `with torch.cuda.device` in their python they are saying they don't care about those things.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/CAAWy80QS085VQMGM8bX-0vvXm3tcBuXEyxwY74KdNkJ1pno4fQ%40mail.gmail.com.
Mahesh> The op-level control you are looking for is done in the `SetEncoding` pass that is run on the program, and is done in the "pre-processing" (actually the new global optimizations pass pipeline which runs before Flow passes)
Ben> anything using hardware information at the frontend level is directly equivalent to `with torch.cuda.device(1)
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/CAH8pnHZWutp69Ep%3DTwvmPmTJ285_G8zgexHgo4ZR9cRF4Ou8Ew%40mail.gmail.com.
Hi all,I have a few, maybe naive, questions.- How are the default tile sizes chosen?- How does this take the target architecture into account?I have a hard time reconciling what is being discussed in this thread: On one hand we say that data-tiling is generally good (which I would agree), on the other hand, it sounds that the tiling is completely hardware independent and this makes no sense to me. (Yes, the technique to do data-tiling should be HW independent but the actual tile sizes shouldn't, IMHO.)Maybe I'm missing something but I would expect that to choose the right tile sizes, we need to have an idea of how big the caches (Lx, shared memory, registers, etc.) are.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/937e62d3-4363-41f7-9ac2-8b2165fde085n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/937e62d3-4363-41f7-9ac2-8b2165fde085n%40googlegroups.com.
Thanks for the clarifications.I thought that enabling data-tiling meant: we hardcode the tile sizes very early in the compiler pipeline :).
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/e6f64760-65d4-4218-b4d6-4741ce6d873en%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/CABPCc9B2WcKf-7pHsLjc-ZrUTicvzDDCK2x7k%3D0%2BpBbGbFBriw%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/CADpnGYEoH8vZOb4NNOGh4UNd5eNrR_1atM0y0J5kb5uxgkoEYg%40mail.gmail.com.