Update on TPP-MLIR integration into IREE

293 views
Skip to first unread message

Niranjan Hasabnis

unread,
Jun 9, 2023, 4:50:44 PM6/9/23
to iree-discuss

Hi All,


This message provides an update regarding our integration of TPP-MLIR into IREE’s LLVM-CPU backend. TPP-MLIR is an MLIR based lowering pipeline that leverages Tensor Processing Primitives (TPP) for optimized performance of linear algebra routines on CPUs (including Intel Xeon). We have integrated TPP-MLIR pipeline as a dispatch lowering pipeline in IREE’s LLVM CPU backend. Our integration leverages latest changes in IREE such as executable plugin. Compiler plugin based implementation is WIP.

 

Recently, we compared performance of IREE’s LLVM CPU backend with our integration on 3-layer multi-layer perceptron (MLP), MatMul + BiasAdd + ReLU. These operations were written as MLIR code in linalg dialect, which were then fed to iree-benchmark-module for performance benchmarking. In this experiment, we compared single-core performance of this model (using –task_topology_max_group_count=1) on 2-socket, 56-core Intel Xeon 8280 processor, codenamed CLX. Multi-core performance comparison is WIP.

 

Figure below shows the performance improvement offered by our integration over base IREE for 3-layer MLP for different input shapes. Overall, we saw up to 7X improvement over base IREE for single-core performance.

 

3layer_mlp_iree_tpp_gflops.jpg

A picture containing text, screenshot, font, number

Description automatically generated

 

Note that our integration is not complete yet – BiasAdd and ReLU ops in MLP are currently lowering through IREE and lowering them via TPP is WIP. More optimizations are also in pipeline and are expected to deliver even better performance.

 

Software config: iree - commit 92f9859168e5a7a27b5c89416919d2add850cd21, tpp-mlir - commit ff8a5fa1aba17094e8b94a293ab3ca66f9b46117.

 

Our integration into IREE can be found at https://github.com/plaidml/iree/tree/tpp-mlir.

 

Thanks,

Intel TPP-MLIR Team

Stella Laurenzo

unread,
Jun 9, 2023, 8:28:57 PM6/9/23
to Niranjan Hasabnis, iree-discuss
Wow -- this looks great!

Where would you like to take this, and how can we help? I feel like there a few different ways to land it, get it hooked up to the frameworks, etc. Mostly picking project/directory structure and some release engineering.

Thank you for sharing!

--
You received this message because you are subscribed to the Google Groups "iree-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iree-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/1b286567-5fda-44e7-84d8-56f5607fdbf6n%40googlegroups.com.

Stella Laurenzo

unread,
Jun 9, 2023, 9:25:30 PM6/9/23
to Stella Laurenzo, Niranjan Hasabnis, iree-discuss

Jacques Pienaar

unread,
Jun 9, 2023, 9:26:16 PM6/9/23
to Stella Laurenzo, Niranjan Hasabnis, iree-discuss
Very nice! And good to hear compiler plugin is WIP, might be simplest to get hooked up.

I'd be curious to see the traces for these. Well and also interested in how this looks vs microkernels and with the dispatch baseline enabled as we are reworking that part and there are some bad cases generated. On Skylake system we measured 1024x1024x1024 matmul between 65 and 153 GFlops depending on flags used, so definite gaps to address for out of box :) (and this with couple of known gaps higher ...). Would help to identify if other issues are present. But that's perhaps too detailed a discussion for here.

Best,

Jacques

On Fri, Jun 9, 2023 at 5:29 PM Stella Laurenzo <ste...@laurenzo.org> wrote:

Renato Golin

unread,
Jun 11, 2023, 4:39:20 PM6/11/23
to iree-discuss
On Saturday, 10 June 2023 at 01:28:57 UTC+1 Stella Laurenzo wrote:
Wow -- this looks great!

Thanks!

Where would you like to take this, and how can we help? I feel like there a few different ways to land it, get it hooked up to the frameworks, etc. Mostly picking project/directory structure and some release engineering.

Our thinking is to finish the compiler plugin and see what's left to merge. We're hoping nothing more than a CMake flag and a new document on how to use it (build tpp-mlir, call IREE with the plugins).

For that we may need some changes in IREE if we find some plugin issues that you haven't faced before, and we'll need to build our project as a shared object (which would even be the default behaviour), so that it's easy to just build and call IREE.

Hopefully we'll also be able to clean our CMake files from IREE specific logic.

If you have pointers to previous integration efforts on compiler/execution plugins so that we know what to expect / plan for, it'd be great!

Thanks!
Renato

Renato Golin

unread,
Jun 11, 2023, 5:06:29 PM6/11/23
to iree-discuss
On Saturday, 10 June 2023 at 02:26:16 UTC+1 Jacques Pienaar wrote:
Very nice! And good to hear compiler plugin is WIP, might be simplest to get hooked up.

I'd be curious to see the traces for these. Well and also interested in how this looks vs microkernels and with the dispatch baseline enabled as we are reworking that part and there are some bad cases generated. On Skylake system we measured 1024x1024x1024 matmul between 65 and 153 GFlops depending on flags used, so definite gaps to address for out of box :) (and this with couple of known gaps higher ...). Would help to identify if other issues are present. But that's perhaps too detailed a discussion for here.

We're not too worried about the numbers. We're not even sure we called it with the right flags anyway, so IREE's low performance could very easily have been our own ignorance.

We did look at our performance, of course, and profile them and have identified as the same problems we have with our own runner (tpp-run) which is good news. It also means IREE isn't adding an onerous overhead on large enough kernels.

We did see, however, that the multi-threading in IREE seems to be broken (again, or ignorance, or default behaviour, ...). 

The first thing is that, even when asking for task topology 56 we only get 16 threads. And the 16 threads (core 4~19) are 100% busy during execution.

The second thing is that even at 16 task topology / threads, performance is ~1.5x, not 16x. I'm guessing that's because IREE mainly uses the CPU to offload kernels to the GPU, so distributing jobs to the CPU like they were GPUs kernels loses locality and increases dispatch costs.

Our aim now is to select topology = 1 and run an OpenMP pass to get the right number of threads pined to the right cores. We do this with our runner, so should be fine and hopefully won't affect IREE's scheduler too much.

If that works, then it'd be up to IREE to choose between accounting for OpenMPI or re-implementing some of that functionality in its own scheduler. We want to use more OpenMP features later on, but we're not stuck with OpenMP itself in IREE and would happily use scheduler features if they provide the same functionality.

Renato

Stella Laurenzo

unread,
Jun 11, 2023, 5:16:07 PM6/11/23
to Renato Golin, iree-discuss
On Sun, Jun 11, 2023 at 1:39 PM Renato Golin <reng...@gmail.com> wrote:
On Saturday, 10 June 2023 at 01:28:57 UTC+1 Stella Laurenzo wrote:
Wow -- this looks great!

Thanks!

Where would you like to take this, and how can we help? I feel like there a few different ways to land it, get it hooked up to the frameworks, etc. Mostly picking project/directory structure and some release engineering.

Our thinking is to finish the compiler plugin and see what's left to merge. We're hoping nothing more than a CMake flag and a new document on how to use it (build tpp-mlir, call IREE with the plugins).

For that we may need some changes in IREE if we find some plugin issues that you haven't faced before, and we'll need to build our project as a shared object (which would even be the default behaviour), so that it's easy to just build and call IREE.

Yeah, we've just been extending the plugin extension points in the compiler on an as-needed basis to make sure that the default pipelines have enough "tap points". Happy to scope and add more.

We have support for runtime shared objects right now. For the moment, we're still building the full libIREECompiler.so statically with the necessary bundled plugins, and leaving the choice of which libIREECompiler.so up to configuration options in the deployment packages. I had a few things in mind for how to take this further. This is all new build stuff and probably easier to have a high bandwidth/f2f conversation -- all things are possible but would like to prioritize the shortest path that realizes the value.

Some existing references:

* Aside from the samples in-tree, the nvgpu project continues to be what is pushing compiler and runtime plugins the furthest.
* Python package for the OpenXLA PJRT CPU plugin (which would need some Python goo to detect/configure things). 
* You'll see that the basic pattern is to dynamically locate shared libraries needed and pass those as configuration keys to the PJRT plugin, registering it with Jax (similar for PyTorch, TF, et al). Right now, it is dynamically locating the libIREECompiler.so only, but this should be extended to also dynamically configure loading of your runtime shared library. It should also do some machine probing to determine whether it should add config keys to enable your plugin. If we allowed second level compiler shared libraries (i.e. libIREEPluginTPP.so) in addition to the mondo build of libIREECompiler.so, then this would be probed for here, version verified and configured on the PJRT plugin to load the shared library along with the compiler (and activate the plugin).
* The PJRT plugin infra also supports dynamically loading a partitioner/pre-processor for distribution, but I assume this would not be used here to start.
* Someone (probably me I guess) will need to bite the bullet and just make the compiler able to load plugins as shared libraries. I know how to do it, and it is hard to do right. But meh. Just been kicking the can down the road so that I can do it in light of a non-contrived use case.

I've been trying to steer this towards binary builds of components that compose vs ever increasing source-level project coupling, because that is impossible to scale for an ecosystem like this that encourages duplication and platform-specific optimizations and bets. It also reduces the "Google tax" whereby any infrastructure that we use internally has to comply with very strict build layering and maintenance burdens: basically, if Google seeks to use pieces of OpenXLA internally, then it will pay the cost to bring the satellite projects into compliance vs forcing it on everyone at inception. Practically, this means you get to ignore Bazel. Hopefully, this also reduces the "human overhead" of ever increasing consensus groups.

Happy to get some face time and make plans. I think that ~everything you want to do can be done more or less today with a couple of light source level changes and some stringing of environment variables. If we could get that proven, then it becomes easier to make it good (the "making it good" part is what I've been iterating on in the context of the PJRT plugin, releasing, etc).


Hopefully we'll also be able to clean our CMake files from IREE specific logic.

If you have pointers to previous integration efforts on compiler/execution plugins so that we know what to expect / plan for, it'd be great!

Thanks!
Renato

--
You received this message because you are subscribed to the Google Groups "iree-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iree-discuss...@googlegroups.com.

Stella Laurenzo

unread,
Jun 11, 2023, 5:18:15 PM6/11/23
to Renato Golin, iree-discuss
On Sun, Jun 11, 2023 at 2:06 PM Renato Golin <reng...@gmail.com> wrote:
On Saturday, 10 June 2023 at 02:26:16 UTC+1 Jacques Pienaar wrote:
Very nice! And good to hear compiler plugin is WIP, might be simplest to get hooked up.

I'd be curious to see the traces for these. Well and also interested in how this looks vs microkernels and with the dispatch baseline enabled as we are reworking that part and there are some bad cases generated. On Skylake system we measured 1024x1024x1024 matmul between 65 and 153 GFlops depending on flags used, so definite gaps to address for out of box :) (and this with couple of known gaps higher ...). Would help to identify if other issues are present. But that's perhaps too detailed a discussion for here.

We're not too worried about the numbers. We're not even sure we called it with the right flags anyway, so IREE's low performance could very easily have been our own ignorance.

We did look at our performance, of course, and profile them and have identified as the same problems we have with our own runner (tpp-run) which is good news. It also means IREE isn't adding an onerous overhead on large enough kernels.

We did see, however, that the multi-threading in IREE seems to be broken (again, or ignorance, or default behaviour, ...). 

The first thing is that, even when asking for task topology 56 we only get 16 threads. And the 16 threads (core 4~19) are 100% busy during execution.

The second thing is that even at 16 task topology / threads, performance is ~1.5x, not 16x. I'm guessing that's because IREE mainly uses the CPU to offload kernels to the GPU, so distributing jobs to the CPU like they were GPUs kernels loses locality and increases dispatch costs.

Our aim now is to select topology = 1 and run an OpenMP pass to get the right number of threads pined to the right cores. We do this with our runner, so should be fine and hopefully won't affect IREE's scheduler too much.


As Mahesh mentioned in the #announcements thread, it's not your job to make sure you are holding that stuff right. I suspect it would be a good exercise to converge it some more, but it certainly doesn't need to be blocking.
 
If that works, then it'd be up to IREE to choose between accounting for OpenMPI or re-implementing some of that functionality in its own scheduler. We want to use more OpenMP features later on, but we're not stuck with OpenMP itself in IREE and would happily use scheduler features if they provide the same functionality.

Seems like we can work that out...
 

Renato

--
You received this message because you are subscribed to the Google Groups "iree-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iree-discuss...@googlegroups.com.

Stella Laurenzo

unread,
Jun 11, 2023, 5:21:08 PM6/11/23
to Renato Golin, iree-discuss

Diego Caballero

unread,
Jun 12, 2023, 1:24:10 AM6/12/23
to Stella Laurenzo, Renato Golin, iree-discuss
Hi there,

Thanks for sharing! Really excited to see the progress. Awesome numbers, indeed. As Jacques mentioned, for that shape and data type IREE numbers should be somewhere between 56-153 GFlops out of the theoretical ~236 GFlops with one thread in our machine. A few things that may help:
  • Make sure you are compiling with `--iree-llvmcpu-target-cpu=cascadelake`
  • If running with one thread, you can run with `--device=local-task` to reduce some scheduling overhead.
  • If you want to try microkernels, you can compile with `--iree-flow-enable-data-tiling --iree-llvmcpu-enable-microkernels`    
Multi-threading issues are known and we have some plans to improve that as part of a larger effort to improve the way we choose vectorization and unrolling factors for the dispatches. The current thread distribution approach hasn't been widely tested beyond 8 threads so I'm not surprised things don't scale beyond that. We also distribute multiple dimensions, which is not ideal on CPUs, as you well mentioned.

We also have some ongoing work on introducing multi-level tiling to improve cache locality on matmuls and we are evaluating different approaches to deal with matmul sizes that are not multiple of the vectorization factor but it may take some time for all of this to land.

Hopefully that gives you more context on where we are.

Thanks,
Diego


Renato Golin

unread,
Jun 12, 2023, 5:49:42 AM6/12/23
to Diego Caballero, Stella Laurenzo, iree-discuss
On Mon, 12 Jun 2023 at 06:24, Diego Caballero <diegoca...@google.com> wrote:
Hopefully that gives you more context on where we are.

That's perfect, thank you! We'll use all of those on our next round. I'm hoping our performance will be on par with each other at that point (both with problems to fix :).

Now, to delving into the compiler plugin problem and hopefully have a prototype soon to iterate on the final steps of integration.

Thanks!
Renato
Reply all
Reply to author
Forward
0 new messages