Max Op in Standard

120 views
Skip to first unread message

Geoffrey Martin-Noble

unread,
May 10, 2019, 4:39:54 PM5/10/19
to ml...@tensorflow.org
We're lowering from the XLA dialect (not open source yet) to use our runtime that has a set of max operations. This felt like one of those things that should be in standard. I noticed that the documentation for the standard select operation mentions that max can be implemented as cmp + select, which makes sense, but isn't ideal if our backend has a max op. Since MLIR is trying to avoid raising operations, this feels like a premature lowering.

What are people's thoughts on having a first-class set of max operations in std ops?

Chris Lattner

unread,
May 10, 2019, 4:53:12 PM5/10/19
to Geoffrey Martin-Noble, ml...@tensorflow.org
On May 10, 2019, at 1:39 PM, 'Geoffrey Martin-Noble' via MLIR <ml...@tensorflow.org> wrote:

We're lowering from the XLA dialect (not open source yet) to use our runtime that has a set of max operations. This felt like one of those things that should be in standard. I noticed that the documentation for the standard select operation mentions that max can be implemented as cmp + select, which makes sense, but isn't ideal if our backend has a max op. Since MLIR is trying to avoid raising operations, this feels like a premature lowering.

What are people's thoughts on having a first-class set of max operations in std ops?

Hi Geoffrey,

Standard ops is still rapidly evolving, and isn’t really well designed or defined yet.  As we get further along, I’d love for us to do a more detailed survey of the prior art in ONNX, nGraph core, and other communities to help define and design it.  Lots of smart people have thought about this problem.


That said, there are some principles that are likely to inform the design:  When standard ops exists, our goal isn’t to “avoid lowering” in it - such a goal could only be achieved by ’standardizing’ all of the ops that all frontends have, which isn’t practical.

Consider something like relu: it is cleanly lowerable to max, which is cleanly lowerable to cmp/select.

My view is that we shouldn’t have relu (or max) as standard ops because of that “cleanly” lowerable aspect: once lowered, it is very simple for backends to pattern match max or relu from the primitive operations.  This is why we’re investing in powerful graph pattern matching infrastructure, to make this easy to do.

When defining standard ops, our goal will be to find a balance between having completely primitive operations as well as the few critical larger-grained operations (e.g. conv) that are critically important to have.  

I suspect that there will be an unending list of higher level operations that various backends will want - some will want to have special support for very complex operations like fused batch norm. My view is that the backend will be able to declare any ops they want as supported (including frontend specific ops like tf.FusedBatchNorm) and if the backend supports it, then those ops will not be lowered.  If a backend does not support it, then the compiler will apply a standard series of expansions to produce the finer grained “standard” ops.

I expect that this will provide a sweet spot where a backend can choose to just implement the fine grained atoms if they want, but they can also choose to implement completely custom high level ops if there is a need or desire to do so.

This infra is still extremely early, but we’ve built it before in the LLVM instruction selection framework for the scalar domain, so I have pretty high confidence that it will come together very nicely.

Does this make sense?

tl;dr: I’d prefer to *not* have a max op :-)

-Chris

Alex Zinenko

unread,
May 10, 2019, 5:59:13 PM5/10/19
to Chris Lattner, Geoffrey Martin-Noble, MLIR
I may go slightly off-topic, but since the remark about max in the rationale is mine, consider it an extended justification for the current state of things.

Despite several attempts, we never agreed on what the "standard" dialect should be or contain in general. Personally, I think it should not be called "standard" or being treated as anyhow special with respect to other dialects. I see more a set of core dialects containing, e.g., scalar/vector operations in one, memref management in another and so on. In such scalar/vector context inspired by LLVM, using cmpi/cmpf followed by select looks reasonable. That being said, we don't want to write std.addf instead of just addf that is unambiguous enough.
The alternative view was for "standard" dialect to contain any op that is not front- or backend specific, including MLIR's own understanding of common ML layers. Hence the semantics that std.add and co have on tensors as pointwise operations. For those, I would prefer to explicitly say they are pointwise, either through a dialect prefix, or through a high-level operation.

With all this context in mind, what is the semantics of min/max you want to add? Are these tensor-level operations that apply point-wise? Do they also want to be min/max reductions along some tensor dimensions? In so, it may make sense to think about a tensor-level core dialect that would be the in between tensor frameworks and middle-level abstractions like yours, and that we know how to lower to core scalar dialect.

The practical reason against first-class min/max at the moment of writing was that we would require mini/maxi + sign attribute and minf/maxf + order attributes, where the attributes partially duplicate those of cmpi and cmpf we need anyway and a single simple 'select' operation

Alex

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/910ED044-F0A2-4F3D-B3EF-C029BBE9EC45%40google.com.

Sean Silva

unread,
May 10, 2019, 6:40:50 PM5/10/19
to MLIR
I generally agree, but inline I have given one caveat that people should be aware of.


On Friday, May 10, 2019 at 1:53:12 PM UTC-7, Chris Lattner wrote:
On May 10, 2019, at 1:39 PM, 'Geoffrey Martin-Noble' via MLIR <ml...@tensorflow.org> wrote:

We're lowering from the XLA dialect (not open source yet) to use our runtime that has a set of max operations. This felt like one of those things that should be in standard. I noticed that the documentation for the standard select operation mentions that max can be implemented as cmp + select, which makes sense, but isn't ideal if our backend has a max op. Since MLIR is trying to avoid raising operations, this feels like a premature lowering.

What are people's thoughts on having a first-class set of max operations in std ops?

Hi Geoffrey,

Standard ops is still rapidly evolving, and isn’t really well designed or defined yet.  As we get further along, I’d love for us to do a more detailed survey of the prior art in ONNX, nGraph core, and other communities to help define and design it.  Lots of smart people have thought about this problem.


That said, there are some principles that are likely to inform the design:  When standard ops exists, our goal isn’t to “avoid lowering” in it - such a goal could only be achieved by ’standardizing’ all of the ops that all frontends have, which isn’t practical.

Consider something like relu: it is cleanly lowerable to max, which is cleanly lowerable to cmp/select.

My view is that we shouldn’t have relu (or max) as standard ops because of that “cleanly” lowerable aspect: once lowered, it is very simple for backends to pattern match max or relu from the primitive operations.  This is why we’re investing in powerful graph pattern matching infrastructure, to make this easy to do.

One caveat is that this can be broken by transformations mangling your pattern. One failure mode I remember from working on an LLVM backend on my previous project is that sometimes when we would merge from upstream, some new InstCombine transformation would break a pattern we were lowering, causing us ISel failures.

One case I remember was related to an LLVM's "CreateVectorSplat" pattern (insertelement + shufflevector). InstCombine commuted some operations with the shufflevector, and would break the vector splat pattern.

-- Sean Silva

Caballero, Diego

unread,
May 10, 2019, 7:30:07 PM5/10/19
to Chris Lattner, Geoffrey Martin-Noble, MLIR

Interesting discussion :)

 

From the LLVM perspective, and the vectorizer in particular, we investigated a similar problem with vector idioms in the vectorizer (optimizer) and how to retrieve or preserve that information all the way down to the backend. Even though we are not talking about the same level of abstraction, maybe that experience might be useful here in some way.

 

We found that for very simple idioms, like min/max/abs, it was feasible to use an LLVM-IR canonical form for them (cmp + select). However, for idioms just a bit more complex (5-10 instructions), a canonical form was unfeasible due to the high number of variants that we could have for the same idiom and the difficulty of preserving them intact until the backend. For those, intrinsics was the suggested way to go.

 

> My view is that we shouldn’t have relu (or max) as standard ops because of that “cleanly” lowerable aspect: once lowered, it is very simple for backends to pattern match max or relu from the primitive operations.  This is why we’re investing in powerful graph pattern matching infrastructure, to make this easy to do.

 

This makes sense to me. Maybe, for the reasons I mentioned before, we might need to evaluate this approach case by case to make sure that we will be able to pattern match the high-level op after the lowering. For complex ops that we could pattern much today, we may also want to consider the impact of assuming that the pattern match will always be true. That may impose some constrains in future optimizations that could potentially change the expected patterns. We would have to make them aware of the patterns, which might not be ideal.

 

> My view is that the backend will be able to declare any ops they want as supported (including frontend specific ops like tf.FusedBatchNorm) and if the backend supports it, then those ops will not be lowered.  If a backend does not support it, then the compiler will apply a standard series of expansions to produce the finer grained “standard” ops.

 

This makes a lot of sense to me. Something to consider here is how keeping an unknown high-level op might prevent other optimizations implemented in the standard dialect. We had the same questions wrt using intrinsics for vector idioms.

 

Thanks!

 

Diego Caballero

nGraph

--

Caballero, Diego

unread,
May 10, 2019, 7:35:49 PM5/10/19
to Sean Silva, MLIR

Yeah, exactly.

 

Diego

 

From: 'Sean Silva' via MLIR [mailto:ml...@tensorflow.org]
Sent: Friday, May 10, 2019 3:41 PM
To: MLIR <ml...@tensorflow.org>
Subject: Re: [mlir] Max Op in Standard

 

I generally agree, but inline I have given one caveat that people should be aware of.

--

You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.

Stella Laurenzo

unread,
May 10, 2019, 8:33:34 PM5/10/19
to Chris Lattner, River Riddle, Geoffrey Martin-Noble, MLIR
From: 'Chris Lattner' via MLIR <ml...@tensorflow.org>
Date: Fri, May 10, 2019 at 1:53 PM
To: Geoffrey Martin-Noble
Cc: <ml...@tensorflow.org>

This makes sense to me. Internally, I've found myself explaining this 1:1 several times, and I think people would benefit from having a FAQ or design doc (if not an initial implementation) to reason about soon. I'd like to start putting some practical pressure on it soon to see how it works. I think +River Riddle has this on his radar, but we don't have an issue for it yet.
 

Does this make sense?

tl;dr: I’d prefer to *not* have a max op :-)

Same. Or a relu, or a relu6, or a relun1.
 

-Chris

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.

Hongbin Zheng

unread,
May 10, 2019, 9:02:55 PM5/10/19
to Chris Lattner, Geoffrey Martin-Noble, ml...@tensorflow.org
On Fri, May 10, 2019 at 1:53 PM 'Chris Lattner' via MLIR
<ml...@tensorflow.org> wrote:
>
> On May 10, 2019, at 1:39 PM, 'Geoffrey Martin-Noble' via MLIR <ml...@tensorflow.org> wrote:
>
>
> We're lowering from the XLA dialect (not open source yet) to use our runtime that has a set of max operations. This felt like one of those things that should be in standard. I noticed that the documentation for the standard select operation mentions that max can be implemented as cmp + select, which makes sense, but isn't ideal if our backend has a max op. Since MLIR is trying to avoid raising operations, this feels like a premature lowering.
>
> What are people's thoughts on having a first-class set of max operations in std ops?

>
>
> Hi Geoffrey,
>
> Standard ops is still rapidly evolving, and isn’t really well designed or defined yet. As we get further along, I’d love for us to do a more detailed survey of the prior art in ONNX, nGraph core, and other communities to help define and design it. Lots of smart people have thought about this problem.

I wonder if we could define a set of "meta" standard ops build in to
the 'M(achine)L(earning)' part of MLIR:
1. Load/Store (memory accesses)
2. elementwise ops (which include broadcast)
3. reduction ops
4. tensor contraction
5. others

Thanks
Hongbin

>
>
> That said, there are some principles that are likely to inform the design: When standard ops exists, our goal isn’t to “avoid lowering” in it - such a goal could only be achieved by ’standardizing’ all of the ops that all frontends have, which isn’t practical.
>
> Consider something like relu: it is cleanly lowerable to max, which is cleanly lowerable to cmp/select.
>
> My view is that we shouldn’t have relu (or max) as standard ops because of that “cleanly” lowerable aspect: once lowered, it is very simple for backends to pattern match max or relu from the primitive operations. This is why we’re investing in powerful graph pattern matching infrastructure, to make this easy to do.
>
> When defining standard ops, our goal will be to find a balance between having completely primitive operations as well as the few critical larger-grained operations (e.g. conv) that are critically important to have.
>
> I suspect that there will be an unending list of higher level operations that various backends will want - some will want to have special support for very complex operations like fused batch norm. My view is that the backend will be able to declare any ops they want as supported (including frontend specific ops like tf.FusedBatchNorm) and if the backend supports it, then those ops will not be lowered. If a backend does not support it, then the compiler will apply a standard series of expansions to produce the finer grained “standard” ops.
>
> I expect that this will provide a sweet spot where a backend can choose to just implement the fine grained atoms if they want, but they can also choose to implement completely custom high level ops if there is a need or desire to do so.
>
> This infra is still extremely early, but we’ve built it before in the LLVM instruction selection framework for the scalar domain, so I have pretty high confidence that it will come together very nicely.
>
> Does this make sense?
>
> tl;dr: I’d prefer to *not* have a max op :-)

>
> -Chris
>

Chris Lattner

unread,
May 10, 2019, 11:35:37 PM5/10/19
to Sean Silva, MLIR
On May 10, 2019, at 3:40 PM, 'Sean Silva' via MLIR <ml...@tensorflow.org> wrote:

Consider something like relu: it is cleanly lowerable to max, which is cleanly lowerable to cmp/select.

My view is that we shouldn’t have relu (or max) as standard ops because of that “cleanly” lowerable aspect: once lowered, it is very simple for backends to pattern match max or relu from the primitive operations.  This is why we’re investing in powerful graph pattern matching infrastructure, to make this easy to do.

One caveat is that this can be broken by transformations mangling your pattern. One failure mode I remember from working on an LLVM backend on my previous project is that sometimes when we would merge from upstream, some new InstCombine transformation would break a pattern we were lowering, causing us ISel failures.

One case I remember was related to an LLVM's "CreateVectorSplat" pattern (insertelement + shufflevector). InstCombine commuted some operations with the shufflevector, and would break the vector splat pattern.

Right, this is a good point.  Part of the issue here is that there are two phases of lowering going on: the case you mention is a lowering that is done because LLVM can’t reliably express the notion of a vector splat, and so it has to be expanded when the vectorizer produces it - MLIR doesn’t have this problem.

I’m only talking about the case of “legalize” in LLVM, which is responsible for mapping illegal ops onto legal ones (through lots of different sorts of transformations).  So long as we preserve a well defined lattice of lowering, we should be ok here.

-Chris

Chris Lattner

unread,
May 10, 2019, 11:41:29 PM5/10/19
to Stella Laurenzo, River Riddle, Geoffrey Martin-Noble, MLIR


On May 10, 2019, at 5:32 PM, Stella Laurenzo <laur...@google.com> wrote:

I suspect that there will be an unending list of higher level operations that various backends will want - some will want to have special support for very complex operations like fused batch norm. My view is that the backend will be able to declare any ops they want as supported (including frontend specific ops like tf.FusedBatchNorm) and if the backend supports it, then those ops will not be lowered.  If a backend does not support it, then the compiler will apply a standard series of expansions to produce the finer grained “standard” ops.

I expect that this will provide a sweet spot where a backend can choose to just implement the fine grained atoms if they want, but they can also choose to implement completely custom high level ops if there is a need or desire to do so.

This infra is still extremely early, but we’ve built it before in the LLVM instruction selection framework for the scalar domain, so I have pretty high confidence that it will come together very nicely.

This makes sense to me. Internally, I've found myself explaining this 1:1 several times, and I think people would benefit from having a FAQ or design doc (if not an initial implementation) to reason about soon. I'd like to start putting some practical pressure on it soon to see how it works. I think +River Riddle has this on his radar, but we don't have an issue for it yet.

I’m hoping to get a chance to work with River on this infra in the near future.

-Chris

Chris Lattner

unread,
May 10, 2019, 11:45:41 PM5/10/19
to Caballero, Diego, Geoffrey Martin-Noble, MLIR
On May 10, 2019, at 4:30 PM, Caballero, Diego <diego.c...@intel.com> wrote: 
From the LLVM perspective, and the vectorizer in particular, we investigated a similar problem with vector idioms in the vectorizer (optimizer) and how to retrieve or preserve that information all the way down to the backend. Even though we are not talking about the same level of abstraction, maybe that experience might be useful here in some way.
 
We found that for very simple idioms, like min/max/abs, it was feasible to use an LLVM-IR canonical form for them (cmp + select). However, for idioms just a bit more complex (5-10 instructions), a canonical form was unfeasible due to the high number of variants that we could have for the same idiom and the difficulty of preserving them intact until the backend. For those, intrinsics was the suggested way to go.

Right.  Note that the vectorizer is a good example of a “level raising” transformation: it is using static analysis and dynamic checks to infer properties of computation to transform into coarser grained computations.  Lowering (transforming into finer grained) are a bit different.  

Here the issue is that LLVM doesn’t have tons of target independent intrinsics, and makes it generally more difficult than it needs to be to define new abstractions like this.  MLIR has the opposite problem as you note below:

> My view is that the backend will be able to declare any ops they want as supported (including frontend specific ops like tf.FusedBatchNorm) and if the backend supports it, then those ops will not be lowered.  If a backend does not support it, then the compiler will apply a standard series of expansions to produce the finer grained “standard” ops.
 
This makes a lot of sense to me. Something to consider here is how keeping an unknown high-level op might prevent other optimizations implemented in the standard dialect. We had the same questions wrt using intrinsics for vector idioms.

This is exactly the problem with making it too easy to define new abstractions: if you pattern match to a new high level operation, optimizations that apply to the individual operations won’t apply to the aggregate.

I suspect there is no magic solution to this, it just means that this aspect of compiler design will still remain an empirical art.  Full employment for compiler developers :-)

-Chris


Geoffrey Martin-Noble

unread,
May 13, 2019, 1:59:56 PM5/13/19
to Chris Lattner, Caballero, Diego, MLIR
Thanks all for the discussion :-)

I think your rationale makes sense, Chris. I guess I got the impression from previous discussions that MLIR was taking a rather harder line against raising than it sounds like you are advocating here.

I also agree with Alex that these kinds of discussions would be simpler if we had a clearly defined vision for "standard", perhaps accompanied by a name other than "standard". Even multiple "standard" dialects might make sense. It seems like there's room for a tensor dialect, somewhat similar to XLA, but not bound to XLA semantics (which I think tries to jam too much behavior into a single op).

For now, I just lowered directly from XLA max, but in the future we may implement a peephole optimizer for cmp+select back to max, at which point it might make sense to also route this via standard ops. It does seem that that kind of logic would be common enough that it need not be backend-specific though, which again suggests to me a common tensor dialect.

Chris Lattner

unread,
May 14, 2019, 1:25:50 AM5/14/19
to Geoffrey Martin-Noble, Caballero, Diego, MLIR
Makes sense, we will definitely continue to re-evaluate this over time, I agree that “standard” isn’t really the right name for this.  Thanks for bringing this up Geoffrey!

-Chris

Sana Damani

unread,
Jun 10, 2019, 2:53:51 PM6/10/19
to MLIR
For now, I just lowered directly from XLA max

Hi Geoffrey,

This is an interesting idea: lower high level ops directly to equivalent back-end ops if the architecture supports it. This certainly avoids the problem of lowering to fine-grained ops and then trying, and potentially failing, to rediscover the pattern.

Do you see any issues with this approach in general? Given that with your approach there are now multiple dialects (standard + your low level dialect), can they still participate in optimizations together? If for example, your back-end supports max but also supports some fused op max+other_op, if you lower max to a different dialect from "other_op", would this fusion then be prevented?

Sana

Geoffrey Martin-Noble

unread,
Jun 10, 2019, 6:06:07 PM6/10/19
to Sana Damani, MLIR
I suspect it's going to be a tradeoff. In the general case, I'm guessing that limiting the number of lowering you have to write is better. As long as you're guaranteed to get some correct pattern then it's probably fine. In particularly performance sensitive cases you could write your special lowering that bipasses the common path. The cost is something like what you mentioned: you may miss out on other optimizations that work on the standard level, like the fusion you described. Or you end up having to rewrite that fusion, which means extra work on both ends. It's nice to be able to take advantage of as much of the core infrastructure as possible.

I believe the improvements River is making to the lowering framework will make all this much easier.

In this particular case, my primary goal was to get some models onto our backend via MLIR so we can start iterating, so I basically just chose the thing that was easiest to implement and not horribly wrong :-D

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.

Sana Damani

unread,
Jun 10, 2019, 6:22:07 PM6/10/19
to MLIR
Thank you for your response Geoffrey.
 
I believe the improvements River is making to the lowering framework will make all this much easier.
Do you know where I can read more about these improvements?
 
 It's nice to be able to take advantage of as much of the core infrastructure as possible.
I agree. But this would require the back-end specific ops to be part of the standard dialect. Do you know why this might be a problem? And do you think being able to inherit from the standard dialect would help (I have created a separate thread on sub-dialects)? This would allow back-ends to add supported ops to specialized versions of the standard dialect but still have access to the general ops and optimizations that belong to the standard dialect.

the backend will be able to declare any ops they want as supported (including frontend specific ops like tf.FusedBatchNorm) and if the backend supports it, then those ops will not be lowered.  If a backend does not support it, then the compiler will apply a standard series of expansions to produce the finer grained “standard” ops.
I believe this approach would also result in mixed dialects (high level op + standard ops instead of low level op + standard ops) thereby preventing optimizations.

Sana
To unsubscribe from this group and stop receiving emails from it, send an email to ml...@tensorflow.org.

Geoffrey Martin-Noble

unread,
Jun 10, 2019, 7:11:03 PM6/10/19
to Sana Damani, River Riddle, MLIR
On Mon, Jun 10, 2019 at 3:22 PM Sana Damani <sana....@intel.com> wrote:
Thank you for your response Geoffrey.
 
I believe the improvements River is making to the lowering framework will make all this much easier.
Do you know where I can read more about these improvements?
+River. I think there isn't a public doc atm, but implementation has already started. Basically the idea is to treat lowerings as a directed graph and declare target ops and available transformations. Then the framework lowers to something legal in the target (if possible) making use of some heuristic cost model to choose paths.
 
 It's nice to be able to take advantage of as much of the core infrastructure as possible.
I agree. But this would require the back-end specific ops to be part of the standard dialect. Do you know why this might be a problem? And do you think being able to inherit from the standard dialect would help (I have created a separate thread on sub-dialects)? This would allow back-ends to add supported ops to specialized versions of the standard dialect but still have access to the general ops and optimizations that belong to the standard dialect.
I don't think that follows. Or maybe I don't understand what you mean by "part of" the standard. You can lower to standard, perform optimization transformations there, and then lower to your backend dialect. I think we want to keep standard/core as stuff that's broadly usable across dialects, not specific to some backend or backend dialect. Maybe what you're describing is defining an op outside of standard that is a subset of some standard op, e.g. addi, but only for i16. Additional type constraints as part of lowerings is something we're looking at, but not as a separate op. Inheritance for that case, doesn't seem like the right paradigm, in particular since it breaks Liskov substitution by adding additional constraints. I think we might get into some gross illegal cases (a lowering could produce invalid IR). We have previously talked about defining attributes on ops that indicates their suitability for some purpose, which may be more appropriate.

the backend will be able to declare any ops they want as supported (including frontend specific ops like tf.FusedBatchNorm) and if the backend supports it, then those ops will not be lowered.  If a backend does not support it, then the compiler will apply a standard series of expansions to produce the finer grained “standard” ops.
I believe this approach would also result in mixed dialects (high level op + standard ops instead of low level op + standard ops) thereby preventing optimizations.
Not sure who said this originally, but yes this would be a conscious decision to lose out on optimizations in core in favor of a direct lowering path that may be known to be better. This corresponds to putting the known op in the target set described above. Alternatively, the backend dialect could define its own op that can be lowered to directly from some frontend op and define that lowering.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/7f126f60-75a2-40d4-af38-4b12e67de9b1%40tensorflow.org.
Reply all
Reply to author
Forward
0 new messages