Questions about performance optimizations in MLIR

223 views
Skip to first unread message

Caballero, Diego

unread,
Jul 12, 2019, 5:49:39 PM7/12/19
to MLIR

Hi all,

 

We are about to start exploring some performance optimization for nGraph in MLIR and we would like to know if there are already any ideas or plans in this regard.

Some comments/questions that come to mind are:

 

1.       Are there more optimization under development? (for example, more loop optimizations based on the affine dialect?)

2.       We need to obtain the target hardware information to properly parametrize optimizations (cache sizes, vector length, number of registers, etc.). In LLVM we have TargetTransformInfo. Any plans on reusing that or following a similar approach in MLIR?

3.       As we know, finding the set of optimizations that yields best performance is a tough problem, in part due to the dependency/impact in profitability/legality that may exist between them. Beyond having an independent cost model per optimization or tweaking optimizations to make them aware of other optimizations, having a more global engine, such as an optimization planner that returns a set of optimizations to apply (VPlan tries to do that wrt vectorization in LLVM) and could take different evaluation strategies (different search approaches of the optimization space, greedier sub-optimal approaches, …) could be interesting. We are aware of the complexity of something like this, though. Any thoughs/plans in this regard?

4.       We are interested in exploiting parallelism and OpenMP in particular. We’ve been following the discussion regarding the OpenMP dialect but we are more interested in having a way to parallelize our ops when their semantics state that is safe to do so. I wondered if having a generic way to describing parallelism in MLIR for this purpose would be interesting. Something that we could later lower to the particular library or model we are interested in (OpenMP, TBB, pthreads, etc.).

 

Sorry, I feel I have more questions than any speficic proposal but hopefully this starts some interesting discussion :).

 

Thank!

Diego

River Riddle

unread,
Jul 12, 2019, 7:49:37 PM7/12/19
to MLIR


On Friday, July 12, 2019 at 2:49:39 PM UTC-7, Caballero, Diego wrote:

Hi all,

Hi
 

 

We are about to start exploring some performance optimization for nGraph in MLIR and we would like to know if there are already any ideas or plans in this regard.

Some comments/questions that come to mind are:

 

1.       Are there more optimization under development? (for example, more loop optimizations based on the affine dialect?)

There are several optimizations under development, e.g. I'm currently working on adding initial inlining support and I believe there is work on generalizing the loop fusion pass to be more pluggable. One current proposal that I am working on is to add the ability for dialects/operations to provide interfaces to transformations/analyses/etc. We really want to promote generic passes that work on many different dialects when possible. 
 

2.       We need to obtain the target hardware information to properly parametrize optimizations (cache sizes, vector length, number of registers, etc.). In LLVM we have TargetTransformInfo. Any plans on reusing that or following a similar approach in MLIR?

Given the many different levels of abstraction that we work under, what information is desired at what level and how to express that to passes/pipelines is still unknown. I don't know of any ongoing work to add a TTI equivalent, but this kind of falls in to the generic pass infra I was talking about above.
 

3.       As we know, finding the set of optimizations that yields best performance is a tough problem, in part due to the dependency/impact in profitability/legality that may exist between them. Beyond having an independent cost model per optimization or tweaking optimizations to make them aware of other optimizations, having a more global engine, such as an optimization planner that returns a set of optimizations to apply (VPlan tries to do that wrt vectorization in LLVM) and could take different evaluation strategies (different search approaches of the optimization space, greedier sub-optimal approaches, …) could be interesting. We are aware of the complexity of something like this, though. Any thoughs/plans in this regard?

This is something that was discussed early on, but isn't something that is currently being worked on. The goal right now is to get the infrastructure to a place that makes this possible. I am interested in the idea though and have thought about it some(e.g. some kind of feedback loop from passes on what they did could be interesting). If you have any thoughts I'd love to hear them.
 

4.       We are interested in exploiting parallelism and OpenMP in particular. We’ve been following the discussion regarding the OpenMP dialect but we are more interested in having a way to parallelize our ops when their semantics state that is safe to do so. I wondered if having a generic way to describing parallelism in MLIR for this purpose would be interesting. Something that we could later lower to the particular library or model we are interested in (OpenMP, TBB, pthreads, etc.).

 

Sorry, I feel I have more questions than any speficic proposal but hopefully this starts some interesting discussion :).

The questions are good!

-- River
 

 

Thank!

Diego

Alex Zinenko

unread,
Jul 15, 2019, 4:18:28 AM7/15/19
to MLIR
Hi Diego,

one clarification question: are you more interested in graph-level optimization or in more machine-level optimization?  Both are possible and will likely to co-exist in MLIR.

On Fri, Jul 12, 2019 at 11:49 PM Caballero, Diego <diego.c...@intel.com> wrote:

Hi all,

 

We are about to start exploring some performance optimization for nGraph in MLIR and we would like to know if there are already any ideas or plans in this regard.

Some comments/questions that come to mind are:

 

1.       Are there more optimization under development? (for example, more loop optimizations based on the affine dialect?)


We are exploring different approaches to loop-level transformations, some of which are based on affine analysis and some others are more declarative and platform-specific (based on the Linalg dialect).
 

2.       We need to obtain the target hardware information to properly parametrize optimizations (cache sizes, vector length, number of registers, etc.). In LLVM we have TargetTransformInfo. Any plans on reusing that or following a similar approach in MLIR?


Eventually, we will have some target model.  I would expect it to be more flexible and multi-level, as everything else in MLIR.  For example, one may have a graph-level dialect with different operations for scalar and vectorized implementations of the same operator and use the target vectorization information, or one may want to know if a given target provides efficient libraries for specific operations so as to exit from MLIR early.
 

3.       As we know, finding the set of optimizations that yields best performance is a tough problem, in part due to the dependency/impact in profitability/legality that may exist between them. Beyond having an independent cost model per optimization or tweaking optimizations to make them aware of other optimizations, having a more global engine, such as an optimization planner that returns a set of optimizations to apply (VPlan tries to do that wrt vectorization in LLVM) and could take different evaluation strategies (different search approaches of the optimization space, greedier sub-optimal approaches, …) could be interesting. We are aware of the complexity of something like this, though. Any thoughs/plans in this regard?


Our general goal for optimizations is to expose a sufficient number of well-composable IR transformations that can be controlled by external mechanisms.  There is some work in that direction regarding loop transformations, where most operations are factored out into utility functions.  We expect to use IR concepts such as attributes or operation traits to capture information necessary for transformation validity, but making transformations composable in general is challenging and there is no magic solution.  It is not yet clear how polyhedral passes (meaning affine ILP-based scheduling) will be tied into a more general picture.  So far, the thinking is to provide multiple optimization flows and let that be another externally tweakable knob.
 

4.       We are interested in exploiting parallelism and OpenMP in particular. We’ve been following the discussion regarding the OpenMP dialect but we are more interested in having a way to parallelize our ops when their semantics state that is safe to do so. I wondered if having a generic way to describing parallelism in MLIR for this purpose would be interesting. Something that we could later lower to the particular library or model we are interested in (OpenMP, TBB, pthreads, etc.).


I am interested in separating the legality information from the execution strategy information.  In particular, marking something as "parallelizable" should not prescribe a specific parallel execution model too early.  However, preserving the "parallelizable" property throughout the transformations and lowerings isn't an easy task.  At the op implementation level, one thing we discussed was to provide some abstract control primitives, such as generic pointwise operations or generic tensor contractions, that could implicitly capture parallelism information along some dimensions without requiring us to run expensive analyses further down the pipeline.  This being said, it applies to parallelization of a specific operation.  Coarser-grain graph-level parallelization is also interesting for us, but likely requires a different representation for the execution strategy (but ideally the same for the legality).

Alex
 

 

Sorry, I feel I have more questions than any speficic proposal but hopefully this starts some interesting discussion :).

 

Thank!

Diego

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/025727872606CE4AA31D9EF6945969BA3CB25ABD%40ORSMSX115.amr.corp.intel.com.


--
-- Alex

Caballero, Diego

unread,
Jul 15, 2019, 3:14:38 PM7/15/19
to MLIR

On Friday, July 12, 2019 at 2:49:39 PM UTC-7, Caballero, Diego wrote:

Hi all,

Hi

 

We are about to start exploring some performance optimization for nGraph in MLIR and we would like to know if there are already any ideas or plans in this regard.

Some comments/questions that come to mind are:

 

1.       Are there more optimization under development? (for example, more loop optimizations based on the affine dialect?)

There are several optimizations under development, e.g. I'm currently working on adding initial inlining support and I believe there is work on generalizing the loop fusion pass to be more pluggable. One current proposal that I am working on is to add the ability for dialects/operations to provide interfaces to transformations/analyses/etc. We really want to promote generic passes that work on many different dialects when possible. 

 

Pretty exciting! Thanks for the heads-up. How generic passes are going to work on potentially any operation is in fact one of the most recurrent questions I’ve seen in internal discussions. Looking forward to learning more about this proposal.

 

2.       We need to obtain the target hardware information to properly parametrize optimizations (cache sizes, vector length, number of registers, etc.). In LLVM we have TargetTransformInfo. Any plans on reusing that or following a similar approach in MLIR?

Given the many different levels of abstraction that we work under, what information is desired at what level and how to express that to passes/pipelines is still unknown. I don't know of any ongoing work to add a TTI equivalent, but this kind of falls in to the generic pass infra I was talking about above.

 

3.       As we know, finding the set of optimizations that yields best performance is a tough problem, in part due to the dependency/impact in profitability/legality that may exist between them. Beyond having an independent cost model per optimization or tweaking optimizations to make them aware of other optimizations, having a more global engine, such as an optimization planner that returns a set of optimizations to apply (VPlan tries to do that wrt vectorization in LLVM) and could take different evaluation strategies (different search approaches of the optimization space, greedier sub-optimal approaches, …) could be interesting. We are aware of the complexity of something like this, though. Any thoughs/plans in this regard?

This is something that was discussed early on, but isn't something that is currently being worked on. The goal right now is to get the infrastructure to a place that makes this possible. I am interested in the idea though and have thought about it some(e.g. some kind of feedback loop from passes on what they did could be interesting). If you have any thoughts I'd love to hear them.

 

It makes sense. The VPlan approach is more in the direction of creating multiple version of the IR representing different optimizations plans and picking the one with the best cost. This would need higher execution time in some cases in favor of a more accurate cost model and a better alignment between the cost that is being modeled and the instructions that are finally generated (this is currently an important issue in LLVM’s LV). I think a generalized version of this idea could also be interesting here and it would be more in line with the flexibility that MLIR pursues, i.e., having an engine that is able to return a profitable (hopefully optimal) list of optimizations to be applied given an strategy. This strategy could be something as simple as gathering analysis info (loop trip counts, mem access pattern, mem/computation ratio, etc.) and designing an optimization plan based on heuristics consuming that info, something more accurate/explicit (VPlan-like), a ML model :)…

 

 

Thanks!

Diego

4.       We are interested in exploiting parallelism and OpenMP in particular. We’ve been following the discussion regarding the OpenMP dialect but we are more interested in having a way to parallelize our ops when their semantics state that is safe to do so. I wondered if having a generic way to describing parallelism in MLIR for this purpose would be interesting. Something that we could later lower to the particular library or model we are interested in (OpenMP, TBB, pthreads, etc.).

 

Sorry, I feel I have more questions than any speficic proposal but hopefully this starts some interesting discussion :).

The questions are good!

 

-- River

 

 

Thank!

Diego

--

You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.

Caballero, Diego

unread,
Jul 15, 2019, 3:42:45 PM7/15/19
to MLIR

From: 'Alex Zinenko' via MLIR [mailto:ml...@tensorflow.org]

Hi Diego,

 

one clarification question: are you more interested in graph-level optimization or in more machine-level optimization?  Both are possible and will likely to co-exist in MLIR.

 

At this particular point we are more interested in loop-level/machine-level optimizations since graph-level optimizations happens in our nGraph IR for different frameworks. We are exploring the level of performance we can get with MLIR/LLVM. We may consider having MLIR-based (ngraph dialect) graph-level optimizations in the future.

 

 

On Fri, Jul 12, 2019 at 11:49 PM Caballero, Diego <diego.c...@intel.com> wrote:

Hi all,

 

We are about to start exploring some performance optimization for nGraph in MLIR and we would like to know if there are already any ideas or plans in this regard.

Some comments/questions that come to mind are:

 

1.       Are there more optimization under development? (for example, more loop optimizations based on the affine dialect?)

 

We are exploring different approaches to loop-level transformations, some of which are based on affine analysis and some others are more declarative and platform-specific (based on the Linalg dialect).

 

Good to know, thank! Please, keep us posted on this. We didn’t have Linalg dialect in the radar since we thought it was more an educational dialect.

 

2.       We need to obtain the target hardware information to properly parametrize optimizations (cache sizes, vector length, number of registers, etc.). In LLVM we have TargetTransformInfo. Any plans on reusing that or following a similar approach in MLIR?

 

Eventually, we will have some target model.  I would expect it to be more flexible and multi-level, as everything else in MLIR.  For example, one may have a graph-level dialect with different operations for scalar and vectorized implementations of the same operator and use the target vectorization information, or one may want to know if a given target provides efficient libraries for specific operations so as to exit from MLIR early.

 

It sounds good. Again, please, keep us posted :). Not a roadblock for now, though. I think we should be able to run some performance experiments without a target model for now.

 

3.       As we know, finding the set of optimizations that yields best performance is a tough problem, in part due to the dependency/impact in profitability/legality that may exist between them. Beyond having an independent cost model per optimization or tweaking optimizations to make them aware of other optimizations, having a more global engine, such as an optimization planner that returns a set of optimizations to apply (VPlan tries to do that wrt vectorization in LLVM) and could take different evaluation strategies (different search approaches of the optimization space, greedier sub-optimal approaches, …) could be interesting. We are aware of the complexity of something like this, though. Any thoughs/plans in this regard?

 

Our general goal for optimizations is to expose a sufficient number of well-composable IR transformations that can be controlled by external mechanisms.  There is some work in that direction regarding loop transformations, where most operations are factored out into utility functions.  We expect to use IR concepts such as attributes or operation traits to capture information necessary for transformation validity, but making transformations composable in general is challenging and there is no magic solution.  It is not yet clear how polyhedral passes (meaning affine ILP-based scheduling) will be tied into a more general picture.  So far, the thinking is to provide multiple optimization flows and let that be another externally tweakable knob.

 

Agreed. I think this aligns well with what I mentioned in River’s reply. I would personally be interested in this area if we are able to find common goals.

 

4.       We are interested in exploiting parallelism and OpenMP in particular. We’ve been following the discussion regarding the OpenMP dialect but we are more interested in having a way to parallelize our ops when their semantics state that is safe to do so. I wondered if having a generic way to describing parallelism in MLIR for this purpose would be interesting. Something that we could later lower to the particular library or model we are interested in (OpenMP, TBB, pthreads, etc.).

 

I am interested in separating the legality information from the execution strategy information.  In particular, marking something as "parallelizable" should not prescribe a specific parallel execution model too early.  However, preserving the "parallelizable" property throughout the transformations and lowerings isn't an easy task.  At the op implementation level, one thing we discussed was to provide some abstract control primitives, such as generic pointwise operations or generic tensor contractions, that could implicitly capture parallelism information along some dimensions without requiring us to run expensive analyses further down the pipeline.  This being said, it applies to parallelization of a specific operation.  Coarser-grain graph-level parallelization is also interesting for us, but likely requires a different representation for the execution strategy (but ideally the same for the legality).

 

Agreed. I think we would be interested in the former (parallelism along some dimensions of an op) for now. Would you have any design document or something sharable from that discussions?

Coarser-grain graph-level parallelism would be more like task parallelism, right?. In addition to the graph-partitioning problem there would be some work to do wrt task scheduling based on dependences between tasks (prioritize tasks that enable more parallelism, critical path in the execution graph, etc.). That’s an interesting problem.

 

Thanks!

Diego

 

Alex

 

 

Sorry, I feel I have more questions than any speficic proposal but hopefully this starts some interesting discussion :).

 

Thank!

Diego

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/025727872606CE4AA31D9EF6945969BA3CB25ABD%40ORSMSX115.amr.corp.intel.com.


 

--

-- Alex

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.

Reply all
Reply to author
Forward
0 new messages