Hi All,
This message provides an update regarding our integration of TPP-MLIR into IREE’s LLVM-CPU backend. TPP-MLIR is an MLIR based lowering pipeline that leverages Tensor Processing Primitives (TPP) for optimized performance of linear algebra routines on CPUs (including Intel Xeon). We have integrated TPP-MLIR pipeline as a dispatch lowering pipeline in IREE’s LLVM CPU backend. Our integration leverages latest changes in IREE such as executable plugin. Compiler plugin based implementation is WIP.
Recently, we compared performance of IREE’s LLVM CPU backend with our integration on 3-layer multi-layer perceptron (MLP), MatMul + BiasAdd + ReLU. These operations were written as MLIR code in linalg dialect, which were then fed to iree-benchmark-module for performance benchmarking. In this experiment, we compared single-core performance of this model (using –task_topology_max_group_count=1) on 2-socket, 56-core Intel Xeon 8280 processor, codenamed CLX. Multi-core performance comparison is WIP.
Figure below shows the performance improvement offered by our integration over base IREE for 3-layer MLP for different input shapes. Overall, we saw up to 7X improvement over base IREE for single-core performance.
Note that our integration is not complete yet – BiasAdd and ReLU ops in MLP are currently lowering through IREE and lowering them via TPP is WIP. More optimizations are also in pipeline and are expected to deliver even better performance.
Software config: iree - commit 92f9859168e5a7a27b5c89416919d2add850cd21, tpp-mlir - commit ff8a5fa1aba17094e8b94a293ab3ca66f9b46117.
Our integration into IREE can be found at https://github.com/plaidml/iree/tree/tpp-mlir.
Thanks,
Intel TPP-MLIR Team
--
You received this message because you are subscribed to the Google Groups "iree-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iree-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/1b286567-5fda-44e7-84d8-56f5607fdbf6n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/CAH8pnHYwamo%3DQVYG-%3Dqu%3DS%3DGsCF6hUhxsNM5WLQc210hbL4%2BbQ%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/CAH8pnHYwamo%3DQVYG-%3Dqu%3DS%3DGsCF6hUhxsNM5WLQc210hbL4%2BbQ%40mail.gmail.com.
Wow -- this looks great!
Where would you like to take this, and how can we help? I feel like there a few different ways to land it, get it hooked up to the frameworks, etc. Mostly picking project/directory structure and some release engineering.
Very nice! And good to hear compiler plugin is WIP, might be simplest to get hooked up.I'd be curious to see the traces for these. Well and also interested in how this looks vs microkernels and with the dispatch baseline enabled as we are reworking that part and there are some bad cases generated. On Skylake system we measured 1024x1024x1024 matmul between 65 and 153 GFlops depending on flags used, so definite gaps to address for out of box :) (and this with couple of known gaps higher ...). Would help to identify if other issues are present. But that's perhaps too detailed a discussion for here.
On Saturday, 10 June 2023 at 01:28:57 UTC+1 Stella Laurenzo wrote:Wow -- this looks great!Thanks!Where would you like to take this, and how can we help? I feel like there a few different ways to land it, get it hooked up to the frameworks, etc. Mostly picking project/directory structure and some release engineering.Our thinking is to finish the compiler plugin and see what's left to merge. We're hoping nothing more than a CMake flag and a new document on how to use it (build tpp-mlir, call IREE with the plugins).For that we may need some changes in IREE if we find some plugin issues that you haven't faced before, and we'll need to build our project as a shared object (which would even be the default behaviour), so that it's easy to just build and call IREE.
Hopefully we'll also be able to clean our CMake files from IREE specific logic.If you have pointers to previous integration efforts on compiler/execution plugins so that we know what to expect / plan for, it'd be great!Thanks!Renato
--
You received this message because you are subscribed to the Google Groups "iree-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iree-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/21c501e5-2465-40b2-8991-0a34e2e0a856n%40googlegroups.com.
On Saturday, 10 June 2023 at 02:26:16 UTC+1 Jacques Pienaar wrote:Very nice! And good to hear compiler plugin is WIP, might be simplest to get hooked up.I'd be curious to see the traces for these. Well and also interested in how this looks vs microkernels and with the dispatch baseline enabled as we are reworking that part and there are some bad cases generated. On Skylake system we measured 1024x1024x1024 matmul between 65 and 153 GFlops depending on flags used, so definite gaps to address for out of box :) (and this with couple of known gaps higher ...). Would help to identify if other issues are present. But that's perhaps too detailed a discussion for here.We're not too worried about the numbers. We're not even sure we called it with the right flags anyway, so IREE's low performance could very easily have been our own ignorance.We did look at our performance, of course, and profile them and have identified as the same problems we have with our own runner (tpp-run) which is good news. It also means IREE isn't adding an onerous overhead on large enough kernels.We did see, however, that the multi-threading in IREE seems to be broken (again, or ignorance, or default behaviour, ...).The first thing is that, even when asking for task topology 56 we only get 16 threads. And the 16 threads (core 4~19) are 100% busy during execution.The second thing is that even at 16 task topology / threads, performance is ~1.5x, not 16x. I'm guessing that's because IREE mainly uses the CPU to offload kernels to the GPU, so distributing jobs to the CPU like they were GPUs kernels loses locality and increases dispatch costs.Our aim now is to select topology = 1 and run an OpenMP pass to get the right number of threads pined to the right cores. We do this with our runner, so should be fine and hopefully won't affect IREE's scheduler too much.
If that works, then it'd be up to IREE to choose between accounting for OpenMPI or re-implementing some of that functionality in its own scheduler. We want to use more OpenMP features later on, but we're not stuck with OpenMP itself in IREE and would happily use scheduler features if they provide the same functionality.
Renato
--
You received this message because you are subscribed to the Google Groups "iree-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iree-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/84ef2c06-0e11-421f-a075-5b839d72b4den%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/CAH8pnHbrDcL7xDyYP10WMZfp7MG%3DK7mhsOT-4Qbnd-edDXVQ2Q%40mail.gmail.com.
Hopefully that gives you more context on where we are.