RFC: Turn the data-tiling path on by default for CPU backends

317 views
Skip to first unread message

Hanhan Wang

unread,
Oct 17, 2023, 6:52:54 PM10/17/23
to iree-d...@googlegroups.com, Benoit Jacob, Stella Laurenzo, Mahesh Ravishankar
Hi folks,

We've been prototyping data tiling for a long time, and recently we are able to get decent performance on x86 and arm CPU. It works great with the models we've been tracking on OSS, in terms of compilation and performance. For some models we can get up to 2x improvements compared to IREE CPU default. +Mahesh Ravishankar gave a talk in MLIR workshop; he can fill in more comparison numbers between data tiling, IREE CPU default and XLA:CPU, if people are interested in.

I had a draft PR, which enables data tiling by default, and it works well with all the unit tests and benchmark suites. We plan to turn the data tiling path on by default for x86 and arm backends. The flags to switch on/off the data tiling are around, and both paths will remain tested on CI. Please let me know what you think!

Also, I'm looking at getting more information about crashes, first and performance regressions later. Feel free to cc me on issues and PRs if there are any issues about data tiling.

Thanks,
- Hanhan

Stella Laurenzo

unread,
Oct 17, 2023, 7:00:10 PM10/17/23
to Hanhan Wang, iree-discuss, Benoit Jacob, Stella Laurenzo, Mahesh Ravishankar
Thank you! +1 on flipping the switch as soon as crashes and stability are ensured, given the perf improvements to modern workloads of interest. We need to keep things moving and concentrating on performant and well invested paths.

--
You received this message because you are subscribed to the Google Groups "iree-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iree-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/CAF245g0p-qEL26TReA%2BOry7oR5MiEJ9XJ7QaJ5EZZ17NoaMpew%40mail.gmail.com.

Diego Caballero

unread,
Oct 17, 2023, 8:57:01 PM10/17/23
to Stella Laurenzo, Hanhan Wang, iree-discuss, Benoit Jacob, Stella Laurenzo, Mahesh Ravishankar
Thanks for the RFC! As I mentioned in the PR, I’m definitely supportive of enabling data tiling by default selectively and incrementally but definitely not supportive of the all-or-nothing approach proposed in the PR. Some points to back up my stand:
  • Data tiling needs a bit more running before we can enable it by default. We have enabled it on two new models internally and hit compilation and runtime issues (see #15061 and #15076) which have significantly delayed the bring-up of the models. I think DT would benefit from a bit more extensive testing beyond CI and the two or three trending models we are looking at right now.
  • Data tiling performance depends on ukernels, which have limited coverage and can’t be enabled by default. We haven’t invested any cycles into code generating “transposed” or mmt4d matrices but perhaps that’s something we can consider to extend data tiling applicability.
  • Data tiling performance also depends on how well, if at all, the pack/unpack operations are fused with their respective producers/consumers and how those fused operations are optimized. Taking a look at the profile information of a model with ~70% execution time on code generated matmuls, data tiling led to 40% execution time on pack/unpack ops. This indicates that there’s more work to be done, especially around propagating the pack/unpack ops further away, as Intel has demonstrated.
In general, I see data tiling as one more codegen “pre-processing” strategy similar to padding-only, peeling or masking, and I think it should be treated as such. We currently have logic to selectively enable these strategies depending on the target, the op itself, shapes available at compile time, etc. We should add data tiling into the equation and enable it for well supported and performant cases, and be ready to move to another strategy when performance and stability dictates so. This should give us the optionality and flexibility needed for this type of enablement.

Regarding concentrating on performant and well invested paths, as far as I know, our team plans to continue investing in data tiling and non-data tiling approaches with performance improvements for both of them to land this quarter and the next one.

Thanks,
Diego

Benoit Jacob

unread,
Oct 17, 2023, 9:22:42 PM10/17/23
to Diego Caballero, Stella Laurenzo, Hanhan Wang, iree-discuss, Stella Laurenzo, Mahesh Ravishankar
On Tue, Oct 17, 2023 at 8:57 PM Diego Caballero <diegoca...@google.com> wrote:
Data tiling performance depends on ukernels, which have limited coverage and can’t be enabled by default.

Re "Data tiling performance depends on ukernels", I wonder what is meant by that? This sounds like Andrzej's comment, to which I replied earlier today.

DT is something that makes sense to do on its own, and that produces performance benefits today, already with UK. In fact, on f32 workloads, where codegen produces code already as fast as UK, there is no measurable performance between DT+UK and just DT alone. With other element types such as (i8 x i8 -> i32), codegen isn't able to generate the right instructions to be as fast as UK, so DT+UK does perform better than DT alone, but that isn't a fact about DT at all. It's just about codegen not doing the right things, orthogonally to DT.

Re "[...] ukernels, which have limited coverage and can’t be enabled by default.", I also would like to understand what is meant here. But I don't want to distract from the present RFC, which is about DT.
Benoit

Stella Laurenzo

unread,
Oct 18, 2023, 12:26:31 AM10/18/23
to Benoit Jacob, Diego Caballero, Stella Laurenzo, Hanhan Wang, iree-discuss, Mahesh Ravishankar
I may not be able to make it to Thursday's Mai-Tai meeting, but it would be great if you all could discuss a bit there. We're leaving quite a bit of performance on the table with this disabled, and I'd love to hear a plan for getting over the hump. I have to admit that all of the data that I am looking at overwhelmingly favors enabling this and working forward from there. If the data is lying or isn't being taken in full context, then that would be a good thing to isolate and understand.


Mahesh Ravishankar

unread,
Oct 18, 2023, 12:45:14 AM10/18/23
to Benoit Jacob, Diego Caballero, Stella Laurenzo, Hanhan Wang, iree-discuss, Stella Laurenzo
Thanks Hanhan! 

Some more context on this. We've have been developing this for almost an year now (with distractions 😃). Some of the things we had on our docket before we were comfortable turning this on by default
1) offloading to use microkernels
2) fusion of pack/unpack with producers/consumers
3) constant folding of pack operations.

These were all individually significant amount of work, but the understanding was that all of these need to happen to get good performance. Now that all of these are in, we are on par or better than the non data tiling path for x86 and arm. No path is perfect, there will be cases where the data tiling path is slower, but those are really probably bugs. For context, here are the same numbers I used for the MLIR workshop talk, but including the default IREE numbers (all numbers are x86 and lower is better)


image.png

image.png

image.png

So the data tiling path is 2-3x faster in lot of the cases. For GPT2 the gap is smaller since we have to support matvec and vecmat that is being worked on. Apart from performance there are other advantages of the data tiling path
1) We have been wanting to deprecate the pad + hoist that has been stuck in a local minima for a while. Getting out of that has been very challenging. It was very clear that it isnt keeping up with the industry standards, but improving it was extremely hard as well. With data tiling turned on by default, the data tiling path will now be taking over as the load-bearing path. So we can deprecate this fairly complex and fragile path.
2) We have literally done 0 tuning of the tiling parameters, and not included any multi-level tiling that we know will help the data tiling path. The take away is that multi-level tiling will give a bump in performance even with data tiling, but will not fall of a cliff if the lowering config is off by a bit (as happens on the non-data tiling path). So the numbers above have more leg room to grow.
3) I think this is the easiest way to support dynamic shapes (though the vector masking path seems to be generating pretty decent code visually, I havent measured the performance)

All this seems to suggest we should just turn the data tiling path on by default. We are leaving performance on the table (or behind the flag) by not making this on by default. We have received multiple correspondance that say IREE CPU codegen is below par, since they dont get the best performance out of the box. And to restate the non-data tiling path is still lit up and tested on CI. Its just not the default. Any regressions/disruptions can be mitigated by turning the flag off locally. There is no intent to disrupt any work in flight within IREE (data-tiling or otherwise).

Responding to some of the comments above

> Data tiling needs a bit more running before we can enable it by default. We have enabled it on two new models internally and hit compilation and runtime issues (see #15061 and #15076) which have significantly delayed the bring-up of the models. I think DT would benefit from a bit more extensive testing beyond CI and the two or three trending models we are looking at right now.

It is already tested on CI for months now and has had enough sit time. I really dont see a reason to delay the flip longer. Any crashes that you are seeing I would treat them as prerequisites for the flag flip, and any performance degredation is worth filing an issue for with a repro. You can always just turn off the flag if you are hitting issues to unblock.

> Data tiling performance depends on ukernels, which have limited coverage and can’t be enabled by default. We haven’t invested any cycles into code generating “transposed” or mmt4d matrices but perhaps that’s something we can consider to extend data tiling applicability.

Theoretically no. Data tiling by itself helps performance. In any case, we are only enabling data tiling for archs/matmul data types that are backed by ukernels. Everything will still go through the existing path. Also the difference between ukernels and non-ukernels path for mmt4d comes down to how well vector lowering + LLVM does in terms of instruction selection and scheduling. If we can fix codegen to do a good job, great, but ukernels short-circuit this for the most common cases.

> Data tiling performance also depends on how well, if at all, the pack/unpack operations are fused with their respective producers/consumers and how those fused operations are optimized. Taking a look at the profile information of a model with ~70% execution time on code generated matmuls, data tiling led to 40% execution time on pack/unpack ops. This indicates that there’s more work to be done, especially around propagating the pack/unpack ops further away, as Intel has demonstrated.

Based on the results above and previous numbers I have seen, that high an overhead is an outlier. Definitely worth investigating, but I dont see it as a blocker for flipping the flag. Its one model, and I havent actually seen this model to understand the reason to why this is happening. These numbers dont actually say anything about whether it is faster than the (current) default codegen flow or not. 

Mahesh Ravishankar

unread,
Oct 18, 2023, 12:52:40 AM10/18/23
to Benoit Jacob, Diego Caballero, Stella Laurenzo, Hanhan Wang, iree-discuss, Stella Laurenzo
Oh missed one thing

> Regarding concentrating on performant and well invested paths, as far as I know, our team plans to continue investing in data tiling and non-data tiling approaches with performance improvements for both of them to land this quarter and the next one.

Thats great. Just because the data tiling path is default, the contributions to the non data-tiling path shouldnt stop.... Anything that makes that better is welcome, and I suspect those contributions will go beyond just support for GEMMs. So I am really glad this is being pushed on.

Andrzej Warzynski

unread,
Oct 18, 2023, 11:12:11 AM10/18/23
to iree-discuss
I have an impression that we are converging towards "one size fits all", and I am not sure whether that's the right approach. Looking at various other frontends (e.g. Clang), it is common to fine tune compiler configuration (e.g. backend configuration) for a particular target. And since every CPU (and every micro-architecture) will exhibit different characteristics, I'd assume that it would require dedicated fine-tuning (and, indeed, different lowering strategies). The numbers that Mahesh has shared are super encouraging (well done!), however:
  • are the "Arm" numbers for "Arm NEON" or for "Arm SVE",
  • are these numbers with or without u-kernels (mostly curious how much code-gen contributes to this)?
In the case of "Arm" I think that it's important to make a distinction into NEON and SVE - these are very _very_ different SIMD extensions. We are focusing our efforts on SVE, but I don't quite know what the status of data tiling is in that area (we have only recently enabled scalable vectorisation). In particular, I would rather avoid telling our users to disable data tiling because it doesn't work or doesn't offer the performance gains that they are promised. Having said, I appreciate that there is no SVE in public CI, which makes us a "downstream" user at this point.

>  We should add data tiling into the equation and enable it for well supported and performant cases, and be ready to move to another strategy when performance and stability dictates so. This should give us the optionality and flexibility needed for this type of enablement.

+1 to having this sort of flexibility.

Mostly just my 2p. Data tiling is great compiler tech, congrats on all the amazing progress!

-Andrzej

Do Po

unread,
Oct 18, 2023, 11:55:43 AM10/18/23
to Hanhan Wang, iree-discuss
Hello,

As other replies point out, tiling with a default size may not be a solution. However, would it be possible to compile and tabulate best practices per example type and architecture?

Not just tiling, but also copying, microkernel use...

This would help a lot external users.

Dumitru

--

Diego Caballero

unread,
Oct 18, 2023, 12:36:02 PM10/18/23
to Do Po, Hanhan Wang, iree-discuss
Thanks for providing more context and sharing more about your use cases! I think we all agree on the value that data tiling brings into the table so I'm optimistic that we can move this forward if we can provide the right level of optionality and composability with the rest of optimization strategies.

Regarding the stability and performance issues, perhaps we could open an Epic and start tracking things there? I would add issues #15061, #15132, #15027 (very useful for profiling large models but low prio). I'll check if we have other issues tracked internally. I can also revisit the performance regression with current ToT and help with a reproducer (revisiting my chat log with Mahesh, there seemed to be dispatches performing tensor copies in isolation at flow level).

I also agree with Andrzej and Dumitru on fine-grained optionality requirements. We have to make sure data tiling integrates well with the rest of the compiler and composes with other strategies. Otherwise, with an all-or-nothing approach we would be putting the bar too high for enabling targets or specific ISAs that are left behind in the first enablement round. Again, not asking anything new here, that's what we have been doing for other optimization strategies. Implementation-wise, perhaps we could just always add the encoding and decide later in the compiler if the transformation is materialized or not based on target info and op information... Would that make sense?

Thanks,
Diego 

Mahesh Ravishankar

unread,
Oct 18, 2023, 12:39:00 PM10/18/23
to Andrzej Warzynski, iree-discuss
On Wed, Oct 18, 2023 at 8:12 AM Andrzej Warzynski <andrzej.wa...@gmail.com> wrote:
I have an impression that we are converging towards "one size fits all", and I am not sure whether that's the right approach. Looking at various other frontends (e.g. Clang), it is common to fine tune compiler configuration (e.g. backend configuration) for a particular target. And since every CPU (and every micro-architecture) will exhibit different characteristics, I'd assume that it would require dedicated fine-tuning (and, indeed, different lowering strategies). The numbers that Mahesh has shared are super encouraging (well done!)

I think this might be a disconnect between how we as compiler developers see the world, and other "users" see the world. We need to develop towards (a) good defaults that dont drop performance off a cliff, and (b) enough specialization hooks to make sure groups that want to invest more can use IREE as a substrate and tune accordingly (like SharkRT over IREE for example). So data tiling as it is today provides reasonable coverage for a range of x86 hardware and ARM hardware (again Benoit can fill in more). We can also have a "generic" x86/ARM solution as well that gives you some benefit. I really dont see how we can do (a) without data-tiling... that has been a  hard problem for me to work through and reach a good end-state (though if people know how to do that, then by all means thats welcome). I am fairly confident that data tiling can give you a reasonable default. It might not be peak, but gets to within 80% peak in most cases. Evidence for this is we actually have done "0" tuning work (no adjusting lowering config tile sizes, etc. there is a deterministic tile size for the inner tile per hardware). My read is that we dont need much tuning to get this to be in reasonable shape, that is great in terms of deployment.

 
, however:
  • are the "Arm" numbers for "Arm NEON" or for "Arm SVE",
I dont think these use SVE, but Benoit/Marie can say more.
 
  • are these numbers with or without u-kernels (mostly curious how much code-gen contributes to this)?
ukernels contribute a lot especially for the non-fp32 cases (though I think these are all fp32)
 
In the case of "Arm" I think that it's important to make a distinction into NEON and SVE - these are very _very_ different SIMD extensions. We are focusing our efforts on SVE, but I don't quite know what the status of data tiling is in that area (we have only recently enabled scalable vectorisation). In particular, I would rather avoid telling our users to disable data tiling because it doesn't work or doesn't offer the performance gains that they are promised. Having said, I appreciate that there is no SVE in public CI, which makes us a "downstream" user at this point.

We should definitely sync up on this. On the data tiling path it might come down to making the inner tile size chosen be dynamic, which is supported in theory, and is used in the VMVX backend. 
 
--
You received this message because you are subscribed to the Google Groups "iree-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iree-discuss...@googlegroups.com.

Mahesh Ravishankar

unread,
Oct 18, 2023, 12:41:11 PM10/18/23
to Diego Caballero, Do Po, Hanhan Wang, iree-discuss
On Wed, Oct 18, 2023 at 9:36 AM 'Diego Caballero' via iree-discuss <iree-d...@googlegroups.com> wrote:
Thanks for providing more context and sharing more about your use cases! I think we all agree on the value that data tiling brings into the table so I'm optimistic that we can move this forward if we can provide the right level of optionality and composability with the rest of optimization strategies.

Regarding the stability and performance issues, perhaps we could open an Epic and start tracking things there? I would add issues #15061, #15132, #15027 (very useful for profiling large models but low prio). I'll check if we have other issues tracked internally. I can also revisit the performance regression with current ToT and help with a reproducer (revisiting my chat log with Mahesh, there seemed to be dispatches performing tensor copies in isolation at flow level).

I also agree with Andrzej and Dumitru on fine-grained optionality requirements. We have to make sure data tiling integrates well with the rest of the compiler and composes with other strategies. Otherwise, with an all-or-nothing approach we would be putting the bar too high for enabling targets or specific ISAs that are left behind in the first enablement round. Again, not asking anything new here, that's what we have been doing for other optimization strategies. Implementation-wise, perhaps we could just always add the encoding and decide later in the compiler if the transformation is materialized or not based on target info and op information... Would that make sense?


We are only going to be adding the encoding to operations that work well with data tiling + architectures that are supported. So its already being enabled only where it is expected to work well AFAICS.
 

Diego Caballero

unread,
Oct 18, 2023, 12:51:04 PM10/18/23
to Mahesh Ravishankar, Do Po, Hanhan Wang, iree-discuss
> We are only going to be adding the encoding to operations that work well with data tiling + architectures that are supported. So its already being enabled only where it is expected to work well AFAICS.

Thanks! Do you have a pointer to this? I thought hardware information was not available at this level in the compiler.

Hanhan Wang

unread,
Oct 18, 2023, 12:52:42 PM10/18/23
to Diego Caballero, Mahesh Ravishankar, Do Po, iree-discuss
On Wed, Oct 18, 2023 at 9:51 AM Diego Caballero <diegoca...@google.com> wrote:
> We are only going to be adding the encoding to operations that work well with data tiling + architectures that are supported. So its already being enabled only where it is expected to work well AFAICS.

Thanks! Do you have a pointer to this? I thought hardware information was not available at this level in the compiler.

We've moved the SetEncoding pass to the GlobalOptimization phase. I am still working on it and can signal you once I push it to the draft PR.

Stella Laurenzo

unread,
Oct 18, 2023, 2:44:33 PM10/18/23
to Hanhan Wang, Diego Caballero, Mahesh Ravishankar, Do Po, iree-discuss
Thanks, folks.

I'm definitely supportive of continuing development on multiple strategies and different scenarios. What I'm trying to decide is whether the proposal is actually in conflict -- I think I hear the contributors all saying the same thing but with slightly different levels of enthusiasm and sequencing of what comes next. Those are really good things to flush, and it is great that all of the parties are leaning in there.

But I think this proposal is about establishing a much better default for cases that are known to be of benefit -- and had a fringe benefit of untangling some things that will make that next steps easier.

When establishing such defaults we need to be following the data and selecting strategies that have good common case fan out. I think this proposal meets that bar.

So can you all help check my assumptions above? I may be missing some nuance? This just sounds like a case of people all being eager for the future where this proposal is about a step in the present.

Benoit Jacob

unread,
Oct 18, 2023, 4:17:39 PM10/18/23
to iree-discuss
Objections to data-tiling fall into 3 categories:
  1. Caring for e2e matmul performance, agreeing that data-tiling is ultimately necessary for it on about every CPU target, just discussing where to set the bar for benchmark parity before flipping the switch.
  2. Caring for e2e matmul performance and not agreeing that data-tiling is ultimately necessary for it on about every CPU target. Maybe believing that data-tiling is only needed on some architectures for some architecture-specific reason.
  3. Caring for other traits than e2e matmul performance.
My take on these:

1. Is formally valid: it is sane to require some metric of benchmark parity before flipping a switch. It comes down to a choice of metric. As has been pointed out, various averaged metrics point heavily to data-tiling, while some worst-case metrics point in the other direction. Like Mahesh, I believe that the averaged metrics should override the worst-case metrics and that flipping the switch would allow us to make forward progress from there.

2. Is wrong. It has been a point of consensus in matrix multiplication since the 2000s that you do need something like data-tiling.  The Goto paper popularized the notion under the name "packing". One can always dig instances of a particular implementation foregoing packing for a particular shape on a particular target, but the far majority case is that you do need packing. The hardware reality making it so is widely shared: it's just a matter of having any SIMD ISA and any L1 cache.

3. Is probably fine but then please spell out what it is that you care for, so we can understand how to weigh that. Just as an example, if you care about showcasing some SIMD ISA with some simple direct vectorization and lowerings,  data-tiling may feel like a step in the wrong direction. It may not produce the showcase that you were going for. It's fine to care for that but if the user is someone who looks at assembly, they could probably be bothered about having to pass another compiler flag. And if there's a requirement to enable even that without a compiler flag, that would be good to share in this conversation.

Andrzej Warzynski

unread,
Oct 18, 2023, 4:20:03 PM10/18/23
to iree-discuss
> We should definitely sync up on this.

+1 This keeps being delayed at our end, so apologies for not engaging more.


> When establishing such defaults we need to be following the data and selecting strategies that have good common case fan out.

Yes, and the data points at X86 and Arm NEON :)


> This just sounds like a case of people all being eager for the future where this proposal is about a step in the present.

SVE and SME are very much the "present" for us :) (perhaps a bit "delayed present" compared to folks focusing on data tiling). For context, we use IREE to demonstrate the capabilities of scalable vectors/matrices (and we are impressed with IREE). While peak performance is the ultimate goal, we're still in the "architecture enablement" mode. And so we don't have the data to claim that scalable vectors are ready for data tiling (though, as people have hinted, this might be within reach).  To me it sounds like our customers might have different needs.

Going back to the proposal itself:
  • does it mean enabling data tiling for all ops or just matmuls?
  • why couldn't data tiling be a preprocessing strategy that's enabled by default on CPU targets for which it gives peak performance according to data?
-Andrzej

Stella Laurenzo

unread,
Oct 18, 2023, 4:38:58 PM10/18/23
to Andrzej Warzynski, iree-discuss


On Wed, Oct 18, 2023, 1:20 PM Andrzej Warzynski <andrzej.wa...@gmail.com> wrote:
> We should definitely sync up on this.

+1 This keeps being delayed at our end, so apologies for not engaging more.

> When establishing such defaults we need to be following the data and selecting strategies that have good common case fan out.

Yes, and the data points at X86 and Arm NEON :)

> This just sounds like a case of people all being eager for the future where this proposal is about a step in the present.

SVE and SME are very much the "present" for us :)

(Sorry, slip off a very tired brain :) )

I'm really just trying to get my brain wrapped around whether we both/and decisions masquerading as either/or and forcing us into a false debate.

Diego Caballero

unread,
Oct 18, 2023, 6:44:16 PM10/18/23
to Stella Laurenzo, Andrzej Warzynski, iree-discuss
I don’t think the proposal is in conflict or there are any objections against data tiling. The only disagreement is the level of optionality that should be provided for the default enablement. 
To summarize:
  • We all agree that DT is useful and provides value.
  • We all agree that it should be enabled by default on x86 and ARM Neon matmuls.
  • We all agree that stability and performance issues should be addressed before the default enablement.
The disagreement is about whether we should have the capability of enabling/disabling DT for a specific op (e.g., matmul), a specific target (e.g., aarch64) and a specific target feature (e.g., SME). 
Honestly, I’m not sure why this is such a big deal when we have the same level of optionality for other optimization strategies (padding, peeling, masking…). Having this optionality will just help compose DT with the rest of the approaches.

Motivating example: we continue adding support to apply DT to more ops (e.g., matmul, matvec, vecmat, conv, depthwise conv, pooling, etc.) and we want to enable DT on a new target (e.g., RISC-V). 
If I want to turn DT on for that target without having op-level optionality, I would have to make sure that the DT lowering (either ukernels/codegen) performs well (*) for ALL the ops we support DT on at once. 
With op-level optionality we could decide to enable DT for the new target only for one op and incrementally add more and more support. 
The same applies to whether we want to disable DT. We may want to have the chance to disable it for a specific op and not for all the ops.

Again, I’m not sure why we wouldn’t like to have this level of optionality for DT when we do have it for other existing optimization strategies, and the changes needed to get that level of optionality seems minimal.
This doesn’t change the outcome of the proposed enablement in any way!

(*) There has been a significant amount of tuning work to enable DT on x86. E.g., we are generating assembly instructions from MLIR to efficiently implement transpose operations, which are indispensable to get good performance on pack/unpack ops. I don’t think enabling DT for a new target will be just about flipping a flag and getting good performance…

Mahesh Ravishankar

unread,
Oct 18, 2023, 7:30:46 PM10/18/23
to Diego Caballero, Stella Laurenzo, Andrzej Warzynski, iree-discuss
To clarify, the data tiling is enabled only for matmul and batch matmul operations, and soon matvec and vecmat. It will be only used for x86 and ARM cause thats what we have looked at. I am sorry if we gave an impression that it would be turned on for all operations on all CPU backends. That is definitely not the case right now. The op-level control you are looking for is done in the `SetEncoding` pass that is run on the program, and is done in the "pre-processing" (actually the new global optimizations pass pipeline which runs before Flow passes). Maybe this would have been a lot easier to process if that was made clear in the RFC at the outset, sorry about that misunderstanding.

Hanhan Wang

unread,
Oct 19, 2023, 5:09:51 PM10/19/23
to Mahesh Ravishankar, Diego Caballero, Stella Laurenzo, Andrzej Warzynski, iree-discuss
Diego, Andrzej, Benoit, Ben and I had an offline discussion this morning, and came up with other ideas to proceed. The pain point is that we have seen values about data-tiling, but not all the targets are ready for data tiling. We want to keep enough optionality to other targets, so people can keep contributing without passing specical flags.

The prototype was (1) query target information at the SetEncoding stage, and (2) determine if we wanna insert set/unset_encoding ops or not. Ben pointed out that we should have better isolation between setting encodings and querying whole target information, which is a very valid point to me. The logic of SetEncoding should not be entangled with the actual target. And the big missing thing is that we can only materialize encodings for limited targets. IMHO, we should complete the functionality of data-tiling for other CPU targets and IREE backends. We can have a default materialization method, which basically can undo set_/unset_encodings if they are not ready for data-tiling. This will be the default materialization pattern for other CPU targets and IREE backends. Thankfully we are able to early materialize encodings if there is a single target (i.e., not heterogeneous computing). For targets that haven't implemented data-tiling, they can go with the original graph without any special flags. The graph does not change, so it does not turn off fusion opportunities.

The new proposal is

- Implement default materialization pass and use it for other CPU targets and IREE backends.
- Set encodings on matmul and batch_matmul. This will be controlled by a flag (which is on by default).

For x86 CPUs and Arm NEON CPUs, we will be able to set encoding and early materialize encodings in the GlobalOptimization phase. If the control flag is off, we fallback to today's solution. For other CPU targets (like RISC-V, Arm SVE, etc) and other IREE backends, the set/unset_encodings disappear in the early materialization phase. Because they will use the default materialization method. It gives us better isolation because encoding materialization logic only exists in each backend. We do not have to expose the target information to SetEncoding itself.

Mahesh Ravishankar

unread,
Oct 19, 2023, 8:22:14 PM10/19/23
to Hanhan Wang, Diego Caballero, Stella Laurenzo, Andrzej Warzynski, iree-discuss
On Thu, Oct 19, 2023 at 2:09 PM Hanhan Wang <han...@nod-labs.com> wrote:
Diego, Andrzej, Benoit, Ben and I had an offline discussion this morning, and came up with other ideas to proceed. The pain point is that we have seen values about data-tiling, but not all the targets are ready for data tiling. We want to keep enough optionality to other targets, so people can keep contributing without passing specical flags.

The prototype was (1) query target information at the SetEncoding stage, and (2) determine if we wanna insert set/unset_encoding ops or not. Ben pointed out that we should have better isolation between setting encodings and querying whole target information, which is a very valid point to me. The logic of SetEncoding should not be entangled with the actual target. And the big missing thing is that we can only materialize encodings for limited targets. IMHO, we should complete the functionality of data-tiling for other CPU targets and IREE backends. We can have a default materialization method, which basically can undo set_/unset_encodings if they are not ready for data-tiling. This will be the default materialization pattern for other CPU targets and IREE backends. Thankfully we are able to early materialize encodings if there is a single target (i.e., not heterogeneous computing). For targets that haven't implemented data-tiling, they can go with the original graph without any special flags. The graph does not change, so it does not turn off fusion opportunities.

The new proposal is

- Implement default materialization pass and use it for other CPU targets and IREE backends.
- Set encodings on matmul and batch_matmul. This will be controlled by a flag (which is on by default).

For x86 CPUs and Arm NEON CPUs, we will be able to set encoding and early materialize encodings in the GlobalOptimization phase. If the control flag is off, we fallback to today's solution. For other CPU targets (like RISC-V, Arm SVE, etc) and other IREE backends, the set/unset_encodings disappear in the early materialization phase. Because they will use the default materialization method. It gives us better isolation because encoding materialization logic only exists in each backend. We do not have to expose the target information to SetEncoding itself.

Cool this makes sense! Thanks! I thought this was already what we were doing, but I might have missed some details. Thanks for pushing on this!

Stella Laurenzo

unread,
Oct 19, 2023, 8:23:57 PM10/19/23
to Mahesh Ravishankar, Hanhan Wang, Diego Caballero, Andrzej Warzynski, iree-discuss
I thought this was where this was going to, but thank you for having the discussion and getting everyone on the same page. Communication can be hard and thank you all for prioritizing it.

Stella Laurenzo

unread,
Oct 19, 2023, 8:46:35 PM10/19/23
to Stella Laurenzo, Mahesh Ravishankar, Hanhan Wang, Diego Caballero, Andrzej Warzynski, iree-discuss
(also double checking: this sounds on the same page but is it? It occurs to me that it may be valuable to hear more about some of the other work/directions being pursued, since more visibility will help everyone understand)

Diego Caballero

unread,
Oct 20, 2023, 1:31:30 AM10/20/23
to Stella Laurenzo, Stella Laurenzo, Mahesh Ravishankar, Hanhan Wang, Andrzej Warzynski, iree-discuss
Thanks all for all the clarifications! Yeah, I think the communication went off the rails and my overreaction in the first email didn't help much either. Sorry about that. It's a shame we couldn't catch up during LLVM Dev about this. It would have helped.

I think I'm connecting some dots now:

Hanhan> Ben pointed out that we should have better isolation between setting encodings and querying whole target information, 
Diego> Thanks! Do you have a pointer to this? I thought hardware information was not available at this level in the compiler.

and 

Hanhan> We can have a default materialization method, which basically can undo set_/unset_encodings if they are not ready for data-tiling. 
Diego> Implementation-wise, perhaps we could just always add the encoding and decide later in the compiler if the transformation is materialized or not based on target info and op information... Would that make sense?

seem to be talking about the same, right? Honestly, I don't have an opinion about how this should be implemented. The key point for me is that we have optionality both at target and op level. To be more specific, we should be able to generalize this and this (sorry, this is currently näive and not well architected) so that we can have a single place to encode these optimization decisions, including if data tiling should be used and for which specific ops (see simple examples in the second link). Moving forward, we might even want to turn this into an external extensibility point so that users can override these in-tree decisions and even plug in more involved tuning (e.g., fine-grain results from TD search?) (Just trying to provide more perspective and rationale for the request, not something to discuss right now).

Would the new approach allow/be aligned with this?

Thanks,
Diego

Andrzej Warzyński

unread,
Oct 20, 2023, 11:12:33 AM10/20/23
to iree-discuss
I also appreciate the extra context and apologise if I sounded negative about data tiling. Nothing further from the truth :)

The plan that Hanhan outlined makes sense to me, though to be honest I am yet to fully grasp how data tiling is implemented. From my perspective it is important that we have some control over the lowering so that our users, by default, get something that's:
  • functionally correct (there's still a lot that doesn't work with scalable vectors and  that needs fixing/implementing),
  • optimal given the current state of SVE/SME in IREE/MLIR (even if data tiling works today, it might be sub-optimal and requiring further work).
I believe that what Hanhan is proposing should work, thanks!

Bth, thank you for listening to my concerns and patiently addressing all my questions in the call yesterday.

Best,
Andrzej  

Ben Vanik

unread,
Oct 20, 2023, 11:29:54 AM10/20/23
to Andrzej Warzyński, iree-discuss
RE: Diego> Thanks! Do you have a pointer to this? I thought hardware information was not available at this level in the compiler.

It *shouldn't be* available, and if it is then it's tech debt. We'll push back against hardware information being used at that level in the core compiler - anything using hardware information at the frontend level is directly equivalent to `with torch.cuda.device(1):` kind of stuff in python and it's really bad for the health of the compiler infra. Users will need to pass flags or inject configuration into the inputs to the compiler to make use of such information as then it's a direct user choice to break core compiler functionality like multi-targeting and heterogeneous execution, just as when a user puts `with torch.cuda.device` in their python they are saying they don't care about those things. Plugins and things out of tree can of course do what they want and balance those tradeoffs but we can't make those tradeoffs for users: "out of the box with IREE you get fast matmuls on this particular single-threaded CPU arch _or_ you get multitargeting" is not a feature matrix we want to explain :)

Stella Laurenzo

unread,
Oct 20, 2023, 12:29:25 PM10/20/23
to Ben Vanik, Andrzej Warzyński, iree-discuss


On Fri, Oct 20, 2023, 8:29 AM Ben Vanik <b...@nod-labs.com> wrote:
RE: Diego> Thanks! Do you have a pointer to this? I thought hardware information was not available at this level in the compiler.

It *shouldn't be* available, and if it is then it's tech debt. We'll push back against hardware information being used at that level in the core compiler - anything using hardware information at the frontend level is directly equivalent to `with torch.cuda.device(1):` kind of stuff in python and it's really bad for the health of the compiler infra. Users will need to pass flags or inject configuration into the inputs to the compiler to make use of such information as then it's a direct user choice to break core compiler functionality like multi-targeting and heterogeneous execution, just as when a user puts `with torch.cuda.device` in their python they are saying they don't care about those things.

I'll also note that a lot of the work going into the pytorch frontend is specifically aimed at raising a lot of these "better as use choices or auto tuning" things to that level. Active work going in there. 

Diego Caballero

unread,
Oct 23, 2023, 3:05:22 PM10/23/23
to Stella Laurenzo, Ben Vanik, Andrzej Warzyński, iree-discuss
I thought we were converging but I'm not sure anymore after Ben's email. 

Mahesh> The op-level control you are looking for is done in the `SetEncoding` pass that is run on the program, and is done in the "pre-processing" (actually the new global optimizations pass pipeline which runs before Flow passes)

Is this still the case?

Ben> anything using hardware information at the frontend level is directly equivalent to `with torch.cuda.device(1) 

I've been very intentionally not commenting on implementation details because I wanted to avoid going into this sensitive topic. Again, I don't care about the implementation as long as the optionality is there and strategies compose.
Having said that, I have the impression that there might be a disconnection between what different people understand by target. `with torch.cuda.device(1)` is a scheduling statement that binds the runtime execution to the device #1 in the system. The target information I'm talking about is universal information that is statically available within the compiler to optimize and generate code. It's completely independent of the system we are compiling on and it's not making any runtime/scheduling decisions other than the ISA generated must be supported by whatever device we run it on). Again, I don't want to go into this discussion but I thought it would be important to clarify this.

We'll push back against hardware information being used at that level in the core compiler

On a personal note, this statement makes me feel a bit uncomfortable. Who is included and excluded in this "we"? Perhaps it would be a good time to revisit the governance of the project and how decisions are made. I really would like to see us moving towards a model where more project stakeholders and expert users at different levels are represented in the decision-making process.

Thanks,
Diego


Quentin Colombet

unread,
Oct 24, 2023, 8:55:58 AM10/24/23
to iree-discuss
Hi all,

I have a few, maybe naive, questions.

- How are the default tile sizes chosen?
- How does this take the target architecture into account?

I have a hard time reconciling what is being discussed in this thread: On one hand we say that data-tiling is generally good (which I would agree), on the other hand, it sounds that the tiling is completely hardware independent and this makes no sense to me. (Yes, the technique to do data-tiling should be HW independent but the actual tile sizes shouldn't, IMHO.)
Maybe I'm missing something but I would expect that to choose the right tile sizes, we need to have an idea of how big the caches (Lx, shared memory, registers, etc.) are.

Like Diego said this information is universally and statically available, so:
1. I don't see why there would be a push back on using this information. Maybe I misunderstand the conversation here (in particular data-tiling is not a front-end thing so I have a hard time putting Ben's comment "anything using hardware information at the frontend level is directly equivalent to..." in context.)
2. how do we get universally good performance if we don't account for HW specificities?

Cheers,
-Quentin

Stella Laurenzo

unread,
Oct 24, 2023, 11:52:43 AM10/24/23
to Quentin Colombet, iree-discuss


On Tue, Oct 24, 2023, 5:56 AM 'Quentin Colombet' via iree-discuss <iree-d...@googlegroups.com> wrote:
Hi all,

I have a few, maybe naive, questions.

- How are the default tile sizes chosen?
- How does this take the target architecture into account?

I have a hard time reconciling what is being discussed in this thread: On one hand we say that data-tiling is generally good (which I would agree), on the other hand, it sounds that the tiling is completely hardware independent and this makes no sense to me. (Yes, the technique to do data-tiling should be HW independent but the actual tile sizes shouldn't, IMHO.)
Maybe I'm missing something but I would expect that to choose the right tile sizes, we need to have an idea of how big the caches (Lx, shared memory, registers, etc.) are.

The conversation got a bit off on to hypotheticals. The equivalent of the target triple is technically available at the frontend, and the pushback was on using it there. At the frontend, the preference is to structure the program so that the target survived backend(s) can make detailed decisions. These frontend structures should be target independent as much as possible vs reaching into the target attributes. 

(While there are no examples in tree, there are cases where it is expected that we want to query a target dependent characteristic in the frontend in order to make a whole graph transformation for an optimization that is hard to achieve at a lower level, but tile size selection is not such a case, afaict. When that bridge is crossed, it shouldn't be a naive triple match and needs some additional work/thought to make sure it is done in a multi-target compatible way.)

So what data tiling is doing is applying a form of "symbolic encoding" at the graph level (basically noting that the backend is free to choose a concrete encoding) and then propagating that, handling cancellation, etc. This results in a graph where the tensors have all been marked as encodable. Then the backends choose concrete encodings and corresponding tile sizes. But at the graph/frontend level, it just has the structural information that a backend can make a concrete choice there (including not to encode at all).

That's my understanding anyway at a high level.

I think we're getting hung up on phasing and terminology in the default flow...




Ben Vanik

unread,
Oct 24, 2023, 11:55:20 AM10/24/23
to Quentin Colombet, iree-discuss
It's all about compartmentalization and separation of concerns. The only thing that cares about the exact data layout of a particular logical tensor is the code that is loading/storing tiles of that tensor and everything in the system besides codegen just considers that tensor an opaque bag of bits: so the only place that needs to know the exact data layout of a logical tensor is codegen during code generation. Prior to codegen - which happens at the tail end of compilation with full target information (caveat: codegen should not be assuming static target information! we don't want 10000 copies of every dispatch because an scf.if wasn't emitted! things like the spirv target environment run afoul of this today) - the only thing that matters is that the logical tensors flowing between dispatches are consistently laid out: dispatch A which produces a tensor with some particular layout must have a way to communicate the layout it chose to dispatch B that consumes it. Each backend has a function that takes an entire dispatch region and all its ops, its target information, the tensor (shapes/data type), and some bits defining how the tensor was produced/is consumed that returns the exact data layout: so the only information we need to carry prior to codegen is those extra bits defining how the tensor was produced/is consumed, and we do that in IR with SetEncoding. Once compilation gets to codegen of dispatch A and dispatch B it runs that function and knows exactly what the data layout is while prior to that point it was just "optimal for this usage of this particular tensor on whatever device it ends up running on," and it will be consistent. There are cases that get tricky like multi-targeting with heterogeneous execution but just as with the above (see how images have worked on GPUs for decades with optimal layouts) there are solutions to that and all of them are only possible when starting with something like this.

It's precisely like Java + JNI: you can have a .jar that is platform agnostic and a set of JNI libraries compiled for each architecture. javac doesn't need to know which native architectures you are planning to target with your native code (they are independent compilers and can be compiled in any order) but you can still have "produce_some_data" and "consume_some_data" calls implemented in native code that pass around byte buffers in a totally opaque form: java doesn't care and will happily run `my_library.consume_some_data(my_library.produce_some_data())` even if each target architecture stores the data differently. Now if what you passed through the java level was `MyDataInAFormatForAArch64WithDotProdWhatever` then yeah, you're going to have a hard time, hence why we need the separation of data semantics from the exact layout with tensor encodings (and in java you'd use a `ByteBuffer` to say "don't care, whatever you want").

Now some users are fortunate enough to be able to say "but I know the entire model is running on this particular CPU arch with these particular features and this core count so I can hardcode the data layout of its I/O to this particular one" and that is precisely what I'm referring to with `with torch.device.cuda` in frontends or global compiler flags. There's no ambiguity here: either you don't know precisely (but may know generally) which device a line of code is going to run on until later in compilation where we perform placement and need to keep it "symbolic" _or_ you told IREE precisely where it's going to run and can hardcode it. As we aren't building a compiler that requires users to specify where dispatches run we need to be able to compile programs effectively without such information. If the information is available because the user specified it (or something else did - cost models, whatever) then we can use that optimistically (annotations on certain layers/ops, etc). What we want to avoid is 20-30-40-50-...% of the efficiency that's achievable with the compiler only being possible when the user specifies placement in the inputs - a pass that made things 10% faster across the board for all users with all inputs is worth significantly more to the core infra and its users than a pass that made things 30% faster on a single target+model+data type+compiler configuration as the reach is much smaller (and may be 1). Such is the role of building a piece of generic infrastructure (compilers, media transcoders, OS kernels, drivers, etc). We have plugins as a way for users out of tree to make those highly-specialized decisions if they don't or can't generalize them. It doesn't meant we can't have some specializations in core - especially scoped ones deep in codegen - but introducing whole-program transformations that only work for specific instantiations of the compiler and specific inputs are an anti-pattern in general infra regardless of how valuable they may be to a particular user.

It's all similar to how porting a mess of single-threaded code to (effective) multi-threaded code is nearly impossible but making multi-threaded code run single-threaded is easy: with compiler design making a single-target compiler multi-target is nearly impossible. IREE's goal as a project since inception (so no surprise here I hope) is to multi-target and run heterogeneously and thus that is what we build toward. Single-target/single-device is a degenerate case of this and that's why we _can_ support knowing the information very early on for specific users, but that is not the goal of the project and we don't want to write/maintain/depend on code that treats it as the only case when it's possible to do it in a way that is more in-line with project goals. As such we'll push back on solutions that are "works for my use case" landing in the core code: it's just like how LLVM would not allow additions to compiler-rt that were thread-hostile because the group contributing them was running on bare-metal single-core embedded devices and instead they would say those belong in user code unless it's made safe. The goal with the core IREE infra is to enable special casing but mostly only carry code that is generic in-tree. Contributions to the core compiler are weighed based on how applicable they are to users, how much they disrupt/disturb/delay the project goals, whether they advance the understanding of a particular area in order to help future iterations generalize better, or are isolated to a point where they can be easily added/removed/changed. Normal stuff in infra projects. We're going to need to start cleaning up the tech debt introduced and really getting serious about multi-targeting and multi-device as it's shaping up to be the focus of next year from the core team. Data tiling is one of the foundational pieces of this and Mahesh and Hanhan have done heroic work to retrofit it into the compiler in a way that enables these next steps while demonstrating how target-specific specializations can still exist in generalized infrastructure and it's exciting that more of the compiler will be pulled along in this direction.

HTH; TLDR you can specialize all you want in your own plugins/code, you can optimistically specialize general cases in the core infra if the code still generalizes without the information, and specializations that are highly specific to individual cases are strongly favored against at higher layers of the stack in the core infra for all the above mentioned reasons. Data tiling is a good example of how to add indirection to decouple algorithmic needs from target specialization.


Ben Vanik

unread,
Oct 24, 2023, 12:03:11 PM10/24/23
to Quentin Colombet, iree-discuss
Collision! Stella's summary is spot on and a good TLDR of data tiling:

> So what data tiling is doing is applying a form of "symbolic encoding" at the graph level (basically noting that the backend is free to choose a concrete encoding) and then propagating that, handling cancellation, etc. This results in a graph where the tensors have all been marked as encodable. Then the backends choose concrete encodings and corresponding tile sizes. But at the graph/frontend level, it just has the structural information that a backend can make a concrete choice there (including not to encode at all).

Quentin Colombet

unread,
Oct 24, 2023, 1:09:02 PM10/24/23
to iree-discuss
Thanks for the clarifications.

I thought that enabling data-tiling meant: we hardcode the tile sizes very early in the compiler pipeline :).

Stella Laurenzo

unread,
Oct 24, 2023, 1:15:41 PM10/24/23
to Quentin Colombet, iree-discuss
On Tue, Oct 24, 2023 at 10:09 AM 'Quentin Colombet' via iree-discuss <iree-d...@googlegroups.com> wrote:
Thanks for the clarifications.

I thought that enabling data-tiling meant: we hardcode the tile sizes very early in the compiler pipeline :).

(sorry - I saw this bit flip last week but between being off and a lot of other things, didn't have a chance to write a response)
 

Diego Caballero

unread,
Oct 24, 2023, 5:21:17 PM10/24/23
to Stella Laurenzo, Quentin Colombet, iree-discuss
Thanks for elaborating.

Stella> ... a backend can make a concrete choice there (including not to encode at all).
Diego> Implementation-wise, perhaps we could just always add the encoding and decide later in the compiler if the transformation is materialized or not based on target info and op information.

I think if these two comments are stating the same, that should be enough to implement the optionality needed. Happy to help with the optionality integration side of things once the PR is ready.




Hanhan Wang

unread,
Nov 9, 2023, 6:49:42 PM11/9/23
to Diego Caballero, Stella Laurenzo, Quentin Colombet, iree-discuss
Thank you for the valuable input! Over the past few weeks, I've taken care of most of the concerns and feedback.

In terms of functionality, there are no more crashes, and I want to thank Natasha for extending data-tiling to vecmat and matvec. It gives us more coverage in the data-tiling path.

Regarding performance, we've enhanced codegen for pack/unpack operations and distribution logics. As a result, we've observed significant improvements (with ukernels) when testing new OSS models.

Here are the key highlights of our progress.

- GPT2_117M_TF_1x4xI32 (pixel-6, 1-thread): 309 ms -> 31 ms
- GPT2_117M_TF_1x1xI32 (pixel-6, 4-thread): 76 ms -> 25.9 ms
- Falcon7bGptqPT (x86, 8-thread) 37 sec -> 5.6 sec
- BertLarge_TF (x86, 8-thread): 1548 ms -> 395 ms
- Vit_int8 (pixel-6, 4-thread): 1289 ms -> 649 ms
- MobileNetV2_int8 (x86, 1-thread) 24 ms -> 13 ms
- etc

We've encountered some regressions in the models we've been working on previously. I find these regressions acceptable since they occur only in a few specific cases and are not significant. They only happen on multi-threaded. I believe that with some future work, we can address these issues effectively. Here are the models that have been affected.

- MobileNetV2_fp32 (x86, 8-thread): 4.78 ms -> 6.94 ms
- MobileBertSquad_fp32 (x86, 8-thread): 50.9 ms -> 71.8 ms
- DeepLabV3_fp32 (x86, 8-thread) 8.6 ms -> 10.8 ms

For the backends and CPUs that have not yet implemented data-tiling, it essentially functions as a no-operation (NOP). We have the ability to reverse encodings if they are not yet prepared.

The PR #15256 is up to review. It flips IREE to turn data-tiling on by default. Please let me know if there are any other comments/concerns.

Thanks,
Hanhan

Reply all
Reply to author
Forward
0 new messages