[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

Chandler Carruth

unread,

Oct 13, 2014, 9:02:52 PM10/13/14

to LLVM Developers Mailing List, James Molloy, Zinovy Nis, Andy Trick, Arnold Schwaighofer, Hal Finkel

I've added a straw-man of some extra optimization passes that help specific benchmarks here or there by either preparing code better on the way into the vectorizer or cleaning up afterward. These are off by default until there is some consensus on the right path forward, but this way we can all test out the same set of flags, and collaborate on any tweaks to them.

The primary principle here is that the vectorizer expects the IR input to be in a certain canonical form, and produces IR output that may not yet be in that form. The primary alternative to this is to make the vectorizers both extra powerful (able to recognize many variations on things like loop structure) and extra cautious about their emitted code (so that it is always already optimized). I much prefer the solution of using passes rather than this unless compile time is hurt too drastically. It makes it much easier to test, validate, and compose all of the various components of the core optimizer.

Here is the structural diff:

+ loop-rotate

loop-vectorize

+ early-cse

+ correlated-propagation

+ instcombine

+ licm

+ loop-unswitch

+ simplifycfg

+ instcombine

slp-vectorize

+ early-cse

The rationale I have for this:

1) Zinovy pointed out that the loop vectorizer really needs the input loops to still be rotated. One counter point is that perhaps we should prevent any pass from un-rotating loops?

2) I cherrypicked the core of the scalar optimization pipeline that seems like it would be relevant to code which looks like runtime checks. Things like correlated values for overlap predicates, loop invariant code, or predicates that can be unswitched out of loops. Then I added the canonicalizing passes that might be relevant given those passes.

3) I pulled the EarlyCSE from the BB vectorize stuff. Maybe it isn't relevant for SLP vectorize, no idea. I did say this was a straw man. =D

My benchmarking has shown some modest improvements to benchmarks, but nothing huge. However, it shows only a 2% slowdown for building the 'opt' binary, which I'm actually happy with so that we can work to improve the loop vectorizer's overhead *knowing* that these passes will clean up stuff. Thoughts? I'm currently OK with this, but it's pretty borderline so I just wanted to start the discussion and see what other folks observe in their benchmarking.

-Chandler

Arnold Schwaighofer

unread,

Oct 14, 2014, 11:56:36 AM10/14/14

to Chandler Carruth, LLVM Developers Mailing List, James Molloy

> On Oct 13, 2014, at 5:56 PM, Chandler Carruth <chan...@gmail.com> wrote:
>
> I've added a straw-man of some extra optimization passes that help specific benchmarks here or there by either preparing code better on the way into the vectorizer or cleaning up afterward. These are off by default until there is some consensus on the right path forward, but this way we can all test out the same set of flags, and collaborate on any tweaks to them.
>
> The primary principle here is that the vectorizer expects the IR input to be in a certain canonical form, and produces IR output that may not yet be in that form. The primary alternative to this is to make the vectorizers both extra powerful (able to recognize many variations on things like loop structure) and extra cautious about their emitted code (so that it is always already optimized). I much prefer the solution of using passes rather than this unless compile time is hurt too drastically. It makes it much easier to test, validate, and compose all of the various components of the core optimizer.
>
> Here is the structural diff:
>
> + loop-rotate
> loop-vectorize
> + early-cse
> + correlated-propagation
> + instcombine
> + licm
> + loop-unswitch
> + simplifycfg
> + instcombine
> slp-vectorize
> + early-cse
>

I think a late loop optimization (vectorization) pipeline makes sense. I think we just have to carefully evaluate benefit over compile time.

Runing loop rotation makes sense. Critical edge splitting can transform loops into a form that prevents loop vectorization.

Both the loop vectorizer and the SLPVectorizer perform limited (restricted in region) forms of CSE to cleanup. EarlyCSE runs across the whole function and so might catch more opportunities.

The downside of always running passes is that we pay the cost irrespective of benefit. There might not be much to cleanup if we don’t vectorize a loop but we still have to pay for running the cleanup passes. This has been the motivator to have “pass local” CSE but this also stems from a time where we ran within the inlining pass manager which meant running over and over again.

I think we will just have to look at compile time and decide what makes sense.

> The rationale I have for this:
>
> 1) Zinovy pointed out that the loop vectorizer really needs the input loops to still be rotated. One counter point is that perhaps we should prevent any pass from un-rotating loops?
>
> 2) I cherrypicked the core of the scalar optimization pipeline that seems like it would be relevant to code which looks like runtime checks. Things like correlated values for overlap predicates, loop invariant code, or predicates that can be unswitched out of loops. Then I added the canonicalizing passes that might be relevant given those passes.
>
> 3) I pulled the EarlyCSE from the BB vectorize stuff. Maybe it isn't relevant for SLP vectorize, no idea. I did say this was a straw man. =D
>
>
> My benchmarking has shown some modest improvements to benchmarks, but nothing huge. However, it shows only a 2% slowdown for building the 'opt' binary, which I'm actually happy with so that we can work to improve the loop vectorizer's overhead *knowing* that these passes will clean up stuff. Thoughts? I'm currently OK with this, but it's pretty borderline so I just wanted to start the discussion and see what other folks observe in their benchmarking.
>
> -Chandler

_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Hal Finkel

unread,

Oct 14, 2014, 12:04:42 PM10/14/14

to Arnold Schwaighofer, LLVM Developers Mailing List, James Molloy

In my experience, running a late EarlyCSE produces generic speedups across the board.

-Hal

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Andrew Trick

unread,

Oct 14, 2014, 1:16:11 PM10/14/14

to Arnold Schwaighofer, LLVM Developers Mailing List, James Molloy

> On Oct 14, 2014, at 8:53 AM, Arnold Schwaighofer <aschwa...@apple.com> wrote:
>
>
>> On Oct 13, 2014, at 5:56 PM, Chandler Carruth <chan...@gmail.com> wrote:
>>
>> I've added a straw-man of some extra optimization passes that help specific benchmarks here or there by either preparing code better on the way into the vectorizer or cleaning up afterward. These are off by default until there is some consensus on the right path forward, but this way we can all test out the same set of flags, and collaborate on any tweaks to them.
>>
>> The primary principle here is that the vectorizer expects the IR input to be in a certain canonical form, and produces IR output that may not yet be in that form. The primary alternative to this is to make the vectorizers both extra powerful (able to recognize many variations on things like loop structure) and extra cautious about their emitted code (so that it is always already optimized). I much prefer the solution of using passes rather than this unless compile time is hurt too drastically. It makes it much easier to test, validate, and compose all of the various components of the core optimizer.
>>
>> Here is the structural diff:
>>
>> + loop-rotate
>> loop-vectorize
>> + early-cse
>> + correlated-propagation
>> + instcombine
>> + licm
>> + loop-unswitch
>> + simplifycfg
>> + instcombine
>> slp-vectorize
>> + early-cse
>>
>
> I think a late loop optimization (vectorization) pipeline makes sense. I think we just have to carefully evaluate benefit over compile time.
>
> Runing loop rotation makes sense. Critical edge splitting can transform loops into a form that prevents loop vectorization.
>
> Both the loop vectorizer and the SLPVectorizer perform limited (restricted in region) forms of CSE to cleanup. EarlyCSE runs across the whole function and so might catch more opportunities.
>
> The downside of always running passes is that we pay the cost irrespective of benefit. There might not be much to cleanup if we don’t vectorize a loop but we still have to pay for running the cleanup passes. This has been the motivator to have “pass local” CSE but this also stems from a time where we ran within the inlining pass manager which meant running over and over again.
>
> I think we will just have to look at compile time and decide what makes sense.

It’s great that we’re running the vectorizers late, outside CGSCC. Regarding the set of passes that we rerun, I completely agree with Arnold. Naturally, iterating over the pass pipeline produces speedups, and I understand the engineering advantage. But rerunning several expensive function passes on the slim chance that a loop was transformed is an awful design for compile time.

>> + loop-rotate

I have no concern about loop-rotate. It should be very fast.

>> loop-vectorize
>> + early-cse

Passes like loop-vectorize should be able to do their own CSE without much engineering effort.

>> + correlated-propagation

A little worried about this.

>> + instcombine

I'm *very* concerned about rerunning instcombine, but understand it may help cleanup the vectorized preheader.

>> + licm
>> + loop-unswitch

These should limited to the relevant loop nest.

>> + simplifycfg

OK if the CFG actually changed.

>> + instcombine

instcombine again! This can’t be good.

>> slp-vectorize
>> + early-cse

SLP should do its own CSE.

—

I think it’s generally useful to have an “extreme” level of optimization without much regard for compile time, and in that scenario this pipeline makes sense. But this is hardly something that should happen at -O2/-Os, unless real data shows otherwise.

If the pass manager were designed to conditionally invoke late passes triggered by certain transformation passes, that would solve my immediate concern.

Long term, I think a much better design is for function transformations to be conditionally rerun within a scope/region. For example, loop-vectorize should be able to trigger instcombine on the loop preheader, which I think is the real problem here.

-Andy

Andrew Trick

unread,

Oct 14, 2014, 1:24:56 PM10/14/14

to Arnold Schwaighofer, LLVM Developers Mailing List, James Molloy

One more thing I forgot to mention. I think it makes a lot of sense to have an early canonical instcombine mode to expose opportunities for simplifyCFG and loop passes and a late mode that optimizes for code gen. It’s possible that some expensive logic does not need to be repeatedly applied, although I don’t have evidence of that.

Hal Finkel

unread,

Oct 14, 2014, 1:34:55 PM10/14/14

to Andrew Trick, LLVM Developers Mailing List, James Molloy

----- Original Message -----
> From: "Andrew Trick" <atr...@apple.com>
> To: "Arnold Schwaighofer" <aschwa...@apple.com>
> Cc: "Chandler Carruth" <chan...@gmail.com>, "LLVM Developers Mailing List" <llv...@cs.uiuc.edu>, "James Molloy"
> <ja...@jamesmolloy.co.uk>, "Zinovy Nis" <zinov...@gmail.com>, "Hal Finkel" <hfi...@anl.gov>, "Gerolf Hoflehner"
> <ghofl...@apple.com>
> Sent: Tuesday, October 14, 2014 12:11:43 PM
> Subject: Re: RFC: Should we have (something like) -extra-vectorizer-passes in -O2?
>
>

Why? I understand that it is not cheap (especially because it calls into ValueTracking a lot), but how expensive is it when it has nothing to do?

> but understand it
> may help cleanup the vectorized preheader.
>
> >> + licm
> >> + loop-unswitch
>
> These should limited to the relevant loop nest.
>
> >> + simplifycfg
>
> OK if the CFG actually changed.
>
> >> + instcombine
>
> instcombine again! This can’t be good.
>
> >> slp-vectorize
> >> + early-cse
>
> SLP should do its own CSE.

I'm not sure how much of this is reasonable. Obviously, it can do its own CSE within each vectorization tree. But across trees (where multiple independent parts of the function are vectorized), finding and reusing gather sequences, etc. is a general CSE problem, and I'm not sure how much of that we want to replicate in the SLP vectorizer.

When I switched my internal builds from using the BBVectorizer by default to using the SLP vectorizer by default, I saw a number of performance regressions (mostly not from the vectorization, but from the lack of the 'cleanup' passes, EarlyCSE and InstCombine, that were generally being run afterward). My general impression is that running these passes late in the pipeline brings general benefits.

>
> —
>
> I think it’s generally useful to have an “extreme” level of
> optimization without much regard for compile time, and in that
> scenario this pipeline makes sense. But this is hardly something
> that should happen at -O2/-Os, unless real data shows otherwise.

Doing all this only at >= -O3 does not seem unreasonable to me.

>
> If the pass manager were designed to conditionally invoke late passes
> triggered by certain transformation passes, that would solve my
> immediate concern.
>
> Long term, I think a much better design is for function
> transformations to be conditionally rerun within a scope/region. For
> example, loop-vectorize should be able to trigger instcombine on the
> loop preheader, which I think is the real problem here.

As Chandler might recall ;) -- I've made several requests that the new pass manager design specifically support this.

-Hal

>
> -Andy
>
> >> The rationale I have for this:
> >>
> >> 1) Zinovy pointed out that the loop vectorizer really needs the
> >> input loops to still be rotated. One counter point is that
> >> perhaps we should prevent any pass from un-rotating loops?
> >>
> >> 2) I cherrypicked the core of the scalar optimization pipeline
> >> that seems like it would be relevant to code which looks like
> >> runtime checks. Things like correlated values for overlap
> >> predicates, loop invariant code, or predicates that can be
> >> unswitched out of loops. Then I added the canonicalizing passes
> >> that might be relevant given those passes.
> >>
> >> 3) I pulled the EarlyCSE from the BB vectorize stuff. Maybe it
> >> isn't relevant for SLP vectorize, no idea. I did say this was a
> >> straw man. =D
> >>
> >>
> >> My benchmarking has shown some modest improvements to benchmarks,
> >> but nothing huge. However, it shows only a 2% slowdown for
> >> building the 'opt' binary, which I'm actually happy with so that
> >> we can work to improve the loop vectorizer's overhead *knowing*
> >> that these passes will clean up stuff. Thoughts? I'm currently OK
> >> with this, but it's pretty borderline so I just wanted to start
> >> the discussion and see what other folks observe in their
> >> benchmarking.
> >>
> >> -Chandler
> >
>
>

--

Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________

Andrew Trick

unread,

Oct 14, 2014, 1:51:51 PM10/14/14

to Hal Finkel, LLVM Developers Mailing List, James Molloy

On Oct 14, 2014, at 10:28 AM, Hal Finkel <hfi...@anl.gov> wrote:

+ instcombine

I'm *very* concerned about rerunning instcombine,

Why? I understand that it is not cheap (especially because it calls into ValueTracking a lot), but how expensive is it when it has nothing to do?

Ok, I’ll reserve judgement until I have that data point.

-Andy

Chandler Carruth

unread,

Oct 14, 2014, 1:56:53 PM10/14/14

to Hal Finkel, James Molloy, LLVM Developers Mailing List

On Tue, Oct 14, 2014 at 10:28 AM, Hal Finkel <hfi...@anl.gov> wrote:

> I think it’s generally useful to have an “extreme” level of
> optimization without much regard for compile time, and in that
> scenario this pipeline makes sense. But this is hardly something
> that should happen at -O2/-Os, unless real data shows otherwise.

Doing all this only at >= -O3 does not seem unreasonable to me.

FWIW, I think we're being overly conservative if we're relegating these to -O3 when the total cost is 2%. That doesn't seem like the right tradeoff.

I actually agree that the set I proposed is on the aggressive end -- that was the point -- but we have more than 2% fluctuations in the optimizers' runtime from month to month. If we want to rip stuff out it should be because of a principled reason that it isn't going to help the code in that phase.

Hal Finkel

unread,

Oct 14, 2014, 1:58:02 PM10/14/14

to Chandler Carruth, James Molloy, LLVM Developers Mailing List

----- Original Message -----
> From: "Chandler Carruth" <chan...@google.com>
> To: "Hal Finkel" <hfi...@anl.gov>
> Cc: "Andrew Trick" <atr...@apple.com>, "LLVM Developers Mailing List" <llv...@cs.uiuc.edu>, "James Molloy"
> <ja...@jamesmolloy.co.uk>

> Sent: Tuesday, October 14, 2014 12:41:38 PM
> Subject: Re: [LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?
>
>
>
>
>
> On Tue, Oct 14, 2014 at 10:28 AM, Hal Finkel < hfi...@anl.gov >
> wrote:
>
>
> > I think it’s generally useful to have an “extreme” level of
> > optimization without much regard for compile time, and in that
> > scenario this pipeline makes sense. But this is hardly something
> > that should happen at -O2/-Os, unless real data shows otherwise.
>
> Doing all this only at >= -O3 does not seem unreasonable to me.
> FWIW, I think we're being overly conservative if we're relegating
> these to -O3 when the total cost is 2%. That doesn't seem like the
> right tradeoff.

While I think it is not unreasonable not to do it, I also think it would be reasonable to do it. Late cleanup is currently a place where we could use some improvement.

>
>
> I actually agree that the set I proposed is on the aggressive end --

I did say 'all' ;)

> that was the point -- but we have more than 2% fluctuations in the
> optimizers' runtime from month to month. If we want to rip stuff out
> it should be because of a principled reason that it isn't going to
> help the code in that phase.

I completely agree.

-Hal

Andrew Trick

unread,

Oct 14, 2014, 2:01:22 PM10/14/14

to Chandler Carruth, James Molloy, LLVM Developers Mailing List

The “total cost” is not the same as the overhead when building the opt binary. But the flag is there now, so it’s just a matter of trying it out.

How does the 2% overhead from cleanup compare to the cost of the vectorizer itself?

-Andy

Chandler Carruth

unread,

Oct 14, 2014, 2:04:57 PM10/14/14

to Andrew Trick, James Molloy, LLVM Developers Mailing List

On Tue, Oct 14, 2014 at 10:11 AM, Andrew Trick <atr...@apple.com> wrote:

>> + correlated-propagation

A little worried about this.

>> + instcombine

I'm *very* concerned about rerunning instcombine, but understand it may help cleanup the vectorized preheader.

Why are you concerned? Is instcombine that slow? I usually don't see huge overhead from re-running it on nearly-canonical code. (Oh, I see you just replied to Hal here, fair enough.

>> + licm
>> + loop-unswitch

These should limited to the relevant loop nest.

We have no way to do that currently. Do you think they will in practice be too slow? If so, why? I would naively expect unswitch to be essentially free unless it can do something, and LICM not much more expensive.

>> + simplifycfg

OK if the CFG actually changed.

Again, we have no mechanism to gate this. Frustratingly, the only thing I want here is to delete dead code formed by earlier passes. We just don't have anything cheaper (and I don't have any measurements indicating we need something cheaper).

>> + instcombine

instcombine again! This can’t be good.

I actually have no specific reason to think we need this other than the fact that we run instcombine after simplifycfg in a bunch of other places. If you're looking for one to rip out, this would be the first one I would rip out because I'm doubtful of its value.

On a separate note:

>> + early-cse

Passes like loop-vectorize should be able to do their own CSE without much engineering effort.

>> slp-vectorize
>> + early-cse

SLP should do its own CSE.

I actually agree with you in principle, but I would rather run the pass now (and avoid hacks downstream to essentially do CSE in the backend) than hold up progress on the hope of advanced on-demand CSE layers being added to the vectorizers. I don't know of anyone actually working on that, and so I'm somewhat concerned it will never materialize.

Hal Finkel

unread,

Oct 14, 2014, 2:20:22 PM10/14/14

to Chandler Carruth, James Molloy, LLVM Developers Mailing List

----- Original Message -----
> From: "Chandler Carruth" <chan...@google.com>
> To: "Andrew Trick" <atr...@apple.com>
> Cc: "James Molloy" <ja...@jamesmolloy.co.uk>, "LLVM Developers Mailing List" <llv...@cs.uiuc.edu>
> Sent: Tuesday, October 14, 2014 12:56:46 PM
> Subject: Re: [LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?
>
>
>
>
>
>

I mentioned this in another mail, but to be specific, I'm also inclined to think that, globally, it shouldn't materialize. SLP should do its own internal cleanup, per tree, but cross-tree CSE should likely be left to an actual CSE pass (or perhaps GVN at -O3, but that's another matter).

What I mean is that if we have:

entry:
br %cond, %block1, %block2
block1:
stuff in here is SLP vectorized
...
block2:
stuff in here is SLP vectorized
...

or even just:

entry:
...
stuff in here is SLP vectorized
...
stuff here is also SLP vectorized (using some of the same inputs)
...

there might be some common vector shuffles, insert/extractelement instructions, etc. that are generated in both blocks that CSE might combine. But this is a general CSE problem (especially as these things might be memory operations, and thus need to deal with memory dependency issues), and we should not have new generalized CSE logic in the vectorizers (although we could certainly think about factoring some of the current logic out into utility functions).

-Hal

Andrew Trick

unread,

Oct 14, 2014, 2:24:18 PM10/14/14

to Chandler Carruth, James Molloy, LLVM Developers Mailing List

I’ll summarize your responses as: The new pipeline produces better results than the old, and we currently have no good mechanism for reducing the compile time overhead.

I’ll summarize my criticism as: In principle, there are better ways to clean up after the vectorizer without turning it into a complicated megapass, but no one has done the engineering. I don’t think cleaning up after the vectorizer should incur any noticeable overhead if the vectorizer never runs, and it would be avoidable with a sensibly designed passes that aren’t locked into the current pass manager design.

I don’t have the data right now to argue against enabling the new pipeline under O2. Hopefully others who care about clang compile time will jump in.

As for the long-term plan to improve compile-time, all I can do now is to advocate for a better approach.

-Andy

Nadav Rotem

unread,

Oct 14, 2014, 2:33:52 PM10/14/14

to Andrew Trick, James Molloy, LLVM Developers Mailing List

On Oct 14, 2014, at 11:21 AM, Andrew Trick <atr...@apple.com> wrote:

I actually agree with you in principle, but I would rather run the pass now (and avoid hacks downstream to essentially do CSE in the backend) than hold up progress on the hope of advanced on-demand CSE layers being added to the vectorizers. I don't know of anyone actually working on that, and so I'm somewhat concerned it will never materialize.

The SLP vectorizer already has a built-in CSE. The Loop vectorizer does not need a CSE AFAIK, but it does need InstCombine to cleanup the code that we generate for induction variables and scatter/gather.

Arnold Schwaighofer

unread,

Oct 14, 2014, 2:48:37 PM10/14/14

to Nadav Rotem, James Molloy, LLVM Developers Mailing List

The loop vectorizer also performs CSE internally. There is an interaction between CSE and instcombine (induction variables if I remember correctly) where you want to run CSE followed by instcombine - (maybe even followed by CSE again - not totally positive but I think this was the case) that we would not cleanup otherwise.

Hal Finkel

unread,

Oct 14, 2014, 3:27:12 PM10/14/14

to Andrew Trick, James Molloy, LLVM Developers Mailing List

----- Original Message -----
> From: "Andrew Trick" <atr...@apple.com>
> To: "Chandler Carruth" <chan...@google.com>
> Cc: "James Molloy" <ja...@jamesmolloy.co.uk>, "LLVM Developers Mailing List" <llv...@cs.uiuc.edu>
> Sent: Tuesday, October 14, 2014 1:21:21 PM
> Subject: Re: [LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?
>
>

> I’ll summarize your responses as: The new pipeline produces better
> results than the old, and we currently have no good mechanism for
> reducing the compile time overhead.
>
>
> I’ll summarize my criticism as: In principle, there are better ways
> to clean up after the vectorizer without turning it into a
> complicated megapass, but no one has done the engineering. I don’t
> think cleaning up after the vectorizer should incur any noticeable
> overhead if the vectorizer never runs, and it would be avoidable
> with a sensibly designed passes that aren’t locked into the current
> pass manager design.
>
> I don’t have the data right now to argue against enabling the new
> pipeline under O2. Hopefully others who care about clang compile
> time will jump in.
>
> As for the long-term plan to improve compile-time, all I can do now
> is to advocate for a better approach.

Sure, but we should also have a plan ;) -- I think this is important, and we should make sure this doesn't get dropped. I don't think that the vectorizers are the only thing for which the concept of 'cleanup' passes are relevant. The critical piece here is that, if we don't need to run a cleanup pass, we also don't need to compute any analysis passes on which the pass might depend.

While I'm in favor of the new passes, I also want us to have a plan in this area. I understand that the pass manager is being rewritten, and maybe putting effort into the old one is not worthwhile now, but the new one should certainly have a design compatible with this requirement. Chandler?

-Hal

Robin Morisset

unread,

Oct 14, 2014, 4:59:02 PM10/14/14

to Hal Finkel, James Molloy, LLVM Developers Mailing List

For what it is worth, I agree with the usefulness of having a concept of "cleanup pass". Another example of a situation where it would be nice is in the fence elimination patch I sent for review recently: the pass is rather expensive because it relies on several analysis passes, and is only useful if AtomicExpand introduced fences. Being able to say "Only run this pass if the code was modified by this previous pass" would be awesome.

Yet another example is the EnableAtomicTidy flag on ARM that runs simplifyCFG after AtomicExpand to cleanup some of the load-linked/store-conditional loops: if no such loops have been introduced, the cleanup is unnecessary.

Chandler Carruth

unread,

Oct 14, 2014, 5:15:55 PM10/14/14

to Robin Morisset, James Molloy, LLVM Developers Mailing List

Yes, this is a much needed feature. I have and will continue to push back on trying to add it until we at least get a sane framework for our *existing* pass pipelines though. I'm just a bit worried about scope creep here is all. It's definitely something I want to see happen long-term.

Renato Golin

unread,

Oct 14, 2014, 5:26:04 PM10/14/14

to Chandler Carruth, James Molloy, LLVM Developers Mailing List

On 14 October 2014 22:09, Chandler Carruth <chan...@google.com> wrote:
> Yes, this is a much needed feature. I have and will continue to push back on
> trying to add it until we at least get a sane framework for our *existing*
> pass pipelines though. I'm just a bit worried about scope creep here is all.
> It's definitely something I want to see happen long-term.

Agreed. But we have to be careful to not transform the pass manager
into a makefile in order to avoid multiple passes of the same cleanup
pass. :)

cheers,
--renato

Andrew Trick

unread,

Oct 14, 2014, 6:09:50 PM10/14/14

to Hal Finkel, James Molloy, LLVM Developers Mailing List

Thanks for taking the time to explain! I wasn’t aware of the need to CSE across trees. In general, it would be great to see more comments either in the pass setup or the pass header explaining the dependencies between passes and motivation for those dependencies. Otherwise it gets hard to distinguish between coincidental vs. intentional pass order as things evolve.

-Andy

Hal Finkel

unread,

Oct 14, 2014, 7:07:06 PM10/14/14

to Chandler Carruth, James Molloy, LLVM Developers Mailing List

----- Original Message -----
> From: "Chandler Carruth" <chan...@google.com>
> To: "Robin Morisset" <mori...@google.com>
> Cc: "Hal Finkel" <hfi...@anl.gov>, "James Molloy" <ja...@jamesmolloy.co.uk>, "LLVM Developers Mailing List"
> <llv...@cs.uiuc.edu>
> Sent: Tuesday, October 14, 2014 4:09:11 PM
> Subject: Re: [LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?
>
>

> Yes, this is a much needed feature. [snip] I'm just a bit worried about

> scope creep here is all.

I don't think this is a prerequisite for the particular pass-manager change you're proposing, but I'd like to make sure we're on the same page regarding what should happen in the future.

> I have and will continue to push
> back on trying to add it until we at least get a sane framework for
> our *existing* pass pipelines though.

And this makes me even more convinced that we might not all be on the same page ;) -- I really would like you to clarify exactly what this means, and what it has to do with registration of cleanup passes; or to put it another way, why cleanup-pass registration is not part of bringing further sanity to the configuration.

> It's definitely something I want to see
> happen long-term.

Good.

Thanks again,
Hal

Chandler Carruth

unread,

Oct 14, 2014, 7:08:28 PM10/14/14

to Hal Finkel, James Molloy, LLVM Developers Mailing List

On Tue, Oct 14, 2014 at 3:50 PM, Hal Finkel <hfi...@anl.gov> wrote:

> I have and will continue to push
> back on trying to add it until we at least get a sane framework for
> our *existing* pass pipelines though.

And this makes me even more convinced that we might not all be on the same page ;) -- I really would like you to clarify exactly what this means, and what it has to do with registration of cleanup passes; or to put it another way, why cleanup-pass registration is not part of bringing further sanity to the configuration.

I'm starting off assuming a fixed (IE, known up-front) transformation pass pipeline. That's all.

Zinovy Nis

unread,

Oct 16, 2014, 10:59:49 AM10/16/14

to Chandler Carruth, James Molloy, LLVM Developers Mailing List

Seems that adding -extra-vectorizer-passes doesn't help to vectorizer
in my case. LoopRotation re-run does nothing.

Chandler Carruth

unread,

Oct 16, 2014, 2:25:43 PM10/16/14

to Zinovy Nis, James Molloy, LLVM Developers Mailing List

On Thu, Oct 16, 2014 at 7:55 AM, Zinovy Nis <zinov...@gmail.com> wrote:

Seems that adding -extra-vectorizer-passes doesn't help to vectorizer
in my case. LoopRotation re-run does nothing.

Can you reduce this to a test case you can share?

Hal Finkel

unread,

Oct 16, 2014, 2:40:37 PM10/16/14

to Chandler Carruth, James Molloy, LLVM Developers Mailing List

----- Original Message -----
> From: "Chandler Carruth" <chan...@google.com>
> To: "Zinovy Nis" <zinov...@gmail.com>
> Cc: "Hal Finkel" <hfi...@anl.gov>, "James Molloy" <ja...@jamesmolloy.co.uk>, "LLVM Developers Mailing List"
> <llv...@cs.uiuc.edu>
> Sent: Thursday, October 16, 2014 1:21:19 PM
> Subject: Re: [LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?
>
>
>
>
>

He had posted one earlier here:
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20141006/238660.html
(Arnold had posted some analysis here: http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20141006/239144.html - which I imagine you saw)

-Hal

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Chandler Carruth

unread,

Oct 16, 2014, 2:56:08 PM10/16/14

to Hal Finkel, James Molloy, LLVM Developers Mailing List

On Thu, Oct 16, 2014 at 11:38 AM, Hal Finkel <hfi...@anl.gov> wrote:

He had posted one earlier here:
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20141006/238660.html
(Arnold had posted some analysis here: http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20141006/239144.html - which I imagine you saw)

Doh, sorry. I saw Arnold's analysis but (wrongly) assumed that a complete working test case wasn't available which is why Arnold expected loop-rotate to fix this when it didn't. My bad.

Arnold Schwaighofer

unread,

Oct 16, 2014, 6:11:49 PM10/16/14

to Chandler Carruth, James Molloy, LLVM Developers Mailing List

I quickly took a look again. Instcombine is removing the fast-math flag on the reduction so the vectorizer does not touch it.

bin/clang -O3 -ffast-math -mllvm -extra-vectorizer-passes aobench.cpp -emit-llvm -S -o aobench.2.ll -mllvm -debug-only=loop-vectorize -mllvm -print-before-all -mllvm -print-after-all 2> aobench.2.debug.ll

*** IR Dump Before Combine redundant instructions ***
; Function Attrs: noinline nounwind ssp uwtable
define void @_Z21ambient_occlusion_vecP6_IsectR5vrandILm8EE(%struct._Isect* nocapture %isect, %class.vrand* nocapture readonly dereferenceable(32) %rng) #0 {
entry:
...
br label %for.body

for.body: ; preds = %for.inc.for.body_crit_edge, %entry
%8 = phi float [ %conv.i, %entry ], [ %.pre, %for.inc.for.body_crit_edge ]
%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc.for.body_crit_edge ]
%occlusion.017 = phi float [ 0.000000e+00, %entry ], [ %conv4, %for.inc.for.body_crit_edge ]
%arrayidx = getelementptr inbounds [8 x float]* %rand1, i64 0, i64 %indvars.iv
%9 = load float* %arrayidx, align 4, !tbaa !5
%mul3 = fmul fast float %8, %9
%mul.i.i = fmul fast float %mul3, %mul3
%cmp.i = fcmp olt float %mul.i.i, 0x3C670EF540000000
%conv4 = fadd fast float %occlusion.017, 1.000000e+00 <============= FAST
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%exitcond = icmp eq i64 %indvars.iv, 63
br i1 %exitcond, label %for.end, label %for.inc.for.body_crit_edge
for.inc.for.body_crit_edge: ; preds = %for.body
%arrayidx2.phi.trans.insert = getelementptr inbounds [8 x float]* %rand2, i64 0, i64 %indvars.iv.next
%.pre = load float* %arrayidx2.phi.trans.insert, align 4, !tbaa !5
br label %for.body

for.end: ; preds = %for.body
%t5 = getelementptr inbounds %struct._Isect* %isect, i64 0, i32 0
store float %conv4, float* %t5, align 4, !tbaa !7
ret void
}

*** IR Dump After Combine redundant instructions ***
; Function Attrs: noinline nounwind ssp uwtable
define void @_Z21ambient_occlusion_vecP6_IsectR5vrandILm8EE(%struct._Isect* nocapture %isect, %class.vrand* nocapture readonly dereferenceable(32) %rng) #0 {
entry:
br label %for.body

for.body: ; preds = %for.inc.for.body_crit_edge, %entry
%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc.for.body_crit_edge ]
%occlusion.017 = phi float [ 1.000000e+00, %entry ], [ %phitmp, %for.inc.for.body_crit_edge ]
%exitcond = icmp eq i64 %indvars.iv, 63
br i1 %exitcond, label %for.end, label %for.inc.for.body_crit_edge

for.inc.for.body_crit_edge: ; preds = %for.body
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%phitmp = fadd float %occlusion.017, 1.000000e+00 <=============== NOT FAST
br label %for.body

for.end: ; preds = %for.body
%t5 = getelementptr inbounds %struct._Isect* %isect, i64 0, i32 0
store float %occlusion.017, float* %t5, align 4, !tbaa !1
ret void

Reply all

Reply to author

Forward