iree and xnnpack

Do Po

unread,

Oct 6, 2023, 8:29:21 AM10/6/23

to iree-discuss

Hello,

I have a (probably very stupid) question.

Delegates such as xnnpack allow attaining high perfomance on certain devices, by providing highly-optimized implementations for certain operations.

Would it be possible to use these implementations directly from iree, for certain operators ?

Regards,

Dumitru

Jacques Pienaar

unread,

Oct 6, 2023, 8:36:35 AM10/6/23

to Do Po, iree-discuss

Hey,

Look at https://github.com/openxla/openxla-nvgpu where something like this was done for cudnn and Triton. Note that this is not magic, one needs to consider how these should compose, consider the barriers imposed both optimization and execution wise when doing so (xnnpack is built assuming it's the only thing controlling all the resources, so one would probably be better served to interop with the compute kernels one level below the one exposed to avoid this).

There is early work to showing such a prototype, but calling it early may be overselling it. "Prototype on the list for Q4" would be more accurate.

-- Jacques

--
You received this message because you are subscribed to the Google Groups "iree-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iree-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/1b912e12-e6f2-41aa-82e7-7eeffc8084fen%40googlegroups.com.

Stella Laurenzo

unread,

Oct 6, 2023, 8:50:53 AM10/6/23

to Jacques Pienaar, Do Po, iree-discuss

For modern workloads, it is often more profitable and efficient for us to just take the block level inner instruction sequences and teach the compiler to target those, either through general code generation infrastructure or our C-level block "kernels" (not real kernels: they produce bitcode which is integrated into the low level code generation pipeline and optimized at model compile time).

Without strong evidence that a lower level integration isn't a better fit, the core IREE team isn't integrating such high level compute libraries as-is as black boxes (versus as extensions to the compiler pipeline). But the facilities are available, as Jacques says, to do so, and we have used them with various customers for prototyping and early bringup of new architectures.

To view this discussion on the web visit https://groups.google.com/d/msgid/iree-discuss/CAM4W%2BYdAVwDyty58303geD3T14oa26AL6g7Sf%3D3oVD4ivW0dtA%40mail.gmail.com.

Do Po

unread,

Oct 6, 2023, 9:37:34 AM10/6/23

to Stella Laurenzo, Jacques Pienaar, iree-discuss

Thank you both for providing interesting input.

I guess I owe an explanation for my question, and this will allow me to ask a question in return.

We are trying to sell mlir and iree in an embedded context.

The combination mlir/iree has some very good qualities not found elsewhere : transparency (of the execution model, of all the code that is executed), control (of the compilation process and the generated code), traceability and even fine-grain performance tracing, low footprint, ability to compile more models (less errors, more systematic). But all of this comes at the cost of performance, with significant factors. For instance, a 1.5x/2x performance loss w.r.t. TFlite+delegates on a platform that is well-integrated, such as the Google Pixel phones. On a not-so-well-supported platform such as some Qualcomm SoC the difference w.r.t. the native toolchain (SNPE and now QNN for Qualcomm) is far bigger.

It may be that the traceability and perfomance tracing arguments are interesting enough for the iree solution we propose to be evaluated.

But it's not clear.

My question would be : how is iree justified in production today ?

Best regards,

Dumitru

Stella Laurenzo

unread,

Oct 8, 2023, 2:43:02 PM10/8/23

to Do Po, Jacques Pienaar, iree-discuss

On Fri, Oct 6, 2023, 6:37 AM Do Po <dumitr...@gmail.com> wrote:

Thank you both for providing interesting input.
I guess I owe an explanation for my question, and this will allow me to ask a question in return.

We are trying to sell mlir and iree in an embedded context.
The combination mlir/iree has some very good qualities not found elsewhere : transparency (of the execution model, of all the code that is executed), control (of the compilation process and the generated code), traceability and even fine-grain performance tracing, low footprint, ability to compile more models (less errors, more systematic). But all of this comes at the cost of performance, with significant factors. For instance, a 1.5x/2x performance loss w.r.t. TFlite+delegates on a platform that is well-integrated, such as the Google Pixel phones. On a not-so-well-supported platform such as some Qualcomm SoC the difference w.r.t. the native toolchain (SNPE and now QNN for Qualcomm) is far bigger.

It may be that the traceability and perfomance tracing arguments are interesting enough for the iree solution we propose to be evaluated.
But it's not clear.

My question would be : how is iree justified in production today ?

On larger systems, of which I have more direct current experience, we have never been unable to reach the top tier performance spots. But if a platform is important enough to be a production target, we rarely just take the out of the box default performance: we do what is necessary to target it for the highest performance. Within a "family" this effort tends to be amatorized: advancements for one often translate at least partially to others.

The bigger systems are easier to say general things about because they are all known and, in general, fairly open platforms with a lot of people on them. Deeply embedded is hard because these are often done in private or for specific commercial concerns that don't always matriculate back into the general knowledge pool in a timely manner (or ever). I am aware at least incidentally of a couple, and I believe their targets were met. I know that block level compile time microkernels have been persistently used in this segment (and the feature was developed due to multi party demand from these stakeholders).

Deeply embedded is also hard because if you don't own the platform, a third party often doesn't have access to the IP needed to achieve the best performance. We've done this enough times now that we are confident that given such access and a product goal, we can get there, but that is pretty different from claiming an out of the box experience on an arbitrary platform as an outsider.

Hope that helps... sorry easier answers are not available :/

Reply all

Reply to author

Forward