[llvm-dev] CFG normalization: avoiding `br i1 false`

91 views
Skip to first unread message

Ariel Ben-Yehuda via llvm-dev

unread,
Nov 9, 2017, 6:38:42 PM11/9/17
to llvm...@lists.llvm.org
Hi,

I was looking at Rust programs which are poorly optimized by LLVM, and
one occasional factor is that LLVM allows for long-lived `br i1 false`
instructions.

The problem is that if a propagation pass discovers the condition of a
branch, the branch itself will not be eliminated until a SimpllfyCfg
pass is reached, of which there are few in the pipeline.

One example is https://github.com/rust-lang/rust/issues/44041 - the
branch conditions are found by loop unrolling, but the `br i1 false`
instruction leaks to codegen.

Another example is https://github.com/rust-lang/rust/issues/45466.

In that case, induction variable simplification discovers that an
integer overflow check is unnecessary, but the remaining `br i1 false`
still blocks the following LoopIdiomRecognize pass and prevents the
loop from being optimized to a memset.

While it is possible to have point-fixes for both problems, this might
be widespread enough problem that a more general solution might be
better - for example, to have some sort of canonicalization that
automatically removes these branches in some situations (this requires
things such as the dominator tree to be rebuilt, so it shouldn't be
done literally everywhere, but it can be done fairly often).

What are your opinions?

Thanks,
- Ariel
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Davide Italiano via llvm-dev

unread,
Nov 9, 2017, 6:54:49 PM11/9/17
to Ariel Ben-Yehuda, llvm-dev
On Thu, Nov 9, 2017 at 3:38 PM, Ariel Ben-Yehuda via llvm-dev
<llvm...@lists.llvm.org> wrote:
> Hi,
>
> I was looking at Rust programs which are poorly optimized by LLVM, and
> one occasional factor is that LLVM allows for long-lived `br i1 false`
> instructions.
>
> The problem is that if a propagation pass discovers the condition of a
> branch, the branch itself will not be eliminated until a SimpllfyCfg
> pass is reached, of which there are few in the pipeline.
>
> One example is https://github.com/rust-lang/rust/issues/44041 - the
> branch conditions are found by loop unrolling, but the `br i1 false`
> instruction leaks to codegen.
>
> Another example is https://github.com/rust-lang/rust/issues/45466.
>
> In that case, induction variable simplification discovers that an
> integer overflow check is unnecessary, but the remaining `br i1 false`
> still blocks the following LoopIdiomRecognize pass and prevents the
> loop from being optimized to a memset.
>

SimplifyCFG is the correct pass for this kind of transformation. I
don't think it's entirely unreasonable to run it more often, but the
compile time cost needs to be taken in account.

I'm not necessarily sympathetic to the idea of adding another canonicalization
pass only for this purpose.

Side note:
IMHO, at this point SimplifyCFG is doing even more than it should
(including, e.g. sinking of variable from "almost empty" BBs, i.e. BBs
which have only a single instruction & terminator, if I recall
correctly). If anything, I'd rather split the sinking logic and other operations
to a different pass, rather than moving simplifying proven unconditional
branches elsewhere :)


Also, FWIW, there has been a recent effort to split SimplifyCFG in multiple
variants (with several cl::opt flags associated to enable specific
transformations).


> While it is possible to have point-fixes for both problems, this might
> be widespread enough problem that a more general solution might be
> better - for example, to have some sort of canonicalization that
> automatically removes these branches in some situations (this requires
> things such as the dominator tree to be rebuilt, so it shouldn't be
> done literally everywhere, but it can be done fairly often).
>

Now that LLVM has an incremental dominator API it should be more
feasible to run SimplifyCFG more times without recomputing the
dominator all the time. I haven't checked whether there are other
expensive analyses that need to be preserved.

Thanks,

--
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare

Ariel Ben-Yehuda via llvm-dev

unread,
Nov 15, 2017, 8:15:17 AM11/15/17
to Davide Italiano, llvm-dev
> I'm not necessarily sympathetic to the idea of adding another canonicalization pass only for this purpose.


The problem is that as you said, SimplifyCfg does all sorts of stuff,
and I suspect is not the fastest pass in the world.

Also, in the case that annoys me, there is an LCSSA pass in the
middle, and I suspect it would be a better idea to only do the LCSSA
etc. transform again if no changes were made, which I suspect is the
most common case.

Davide Italiano via llvm-dev

unread,
Nov 15, 2017, 1:03:57 PM11/15/17
to Ariel Ben-Yehuda, llvm-dev
On Wed, Nov 15, 2017 at 5:15 AM, Ariel Ben-Yehuda <arie...@gmail.com> wrote:
>> I'm not necessarily sympathetic to the idea of adding another canonicalization pass only for this purpose.
>
>
> The problem is that as you said, SimplifyCfg does all sorts of stuff,
> and I suspect is not the fastest pass in the world.
>

I've not seen it showing up in a profile myself, although I'm not sure
whether it will be true in the future.
Sinking, e.g., could be moved in a separate pass (in fact, there's one
already, disabled by default).

> Also, in the case that annoys me, there is an LCSSA pass in the
> middle, and I suspect it would be a better idea to only do the LCSSA
> etc. transform again if no changes were made, which I suspect is the
> most common case.
>

LCSSA should be linear. Currently it's not because it uses the SSA
updater which computes dominance frontiers, so it might get O(N^3).
The alternative SSAupdater proposed solves this problem using the
novel algorithm from Braun et al. linked in the review
https://reviews.llvm.org/D28934
It still has the problem that if leaves redundant phis around at the
beginning of the block, e.g.

%patatino1 = phi [ %tinky, %somebb, %winky, %someotherbb ]
%patatino2 = phi [ %tinky, %somebb, %winky, %someotherbb ]

The current SSA updater pays a (sometimes high) cost to perform this
deduplication/removal (actually, checks whether there's already a PHI,
and if not inserts one).

--
Davide

Philip Reames via llvm-dev

unread,
Nov 29, 2017, 12:48:30 PM11/29/17
to Davide Italiano, Ariel Ben-Yehuda, llvm-dev
There's already a LoopSimplifyCFG which is a loop-pass (and thus can
iterate with other loop passes) and eliminates trivial branches.  Having
a simple pass which just strips unreachable blocks and converts
conditional branches to unconditional ones while updating appropriate
analyzes (LoopInfo, DomTree, LCSSA, etc..) seems very reasonable.  This
could also be a utility function called from the end of other passes. 
The hard bit is the analysis preservation.  A good place to copy from
might be the recent loop-unswitch work Chandler did.

Philip

Davide Italiano via llvm-dev

unread,
Nov 29, 2017, 1:09:07 PM11/29/17
to Philip Reames, llvm-dev
On Wed, Nov 29, 2017 at 9:48 AM, Philip Reames
<list...@philipreames.com> wrote:
> There's already a LoopSimplifyCFG which is a loop-pass (and thus can iterate
> with other loop passes) and eliminates trivial branches. Having a simple
> pass which just strips unreachable blocks and converts conditional branches
> to unconditional ones while updating appropriate analyzes (LoopInfo,
> DomTree, LCSSA, etc..) seems very reasonable.

I'm not necessarily convinced having one-trick-pony pass is the best
option here.
It adds complexity, and there's already a pass in-tree which can
provide this functionality.
What are your concerns about running SimplifyCFG more often? It
shouldn't be a compile-time sink

> This could also be a utility
> function called from the end of other passes. The hard bit is the analysis
> preservation. A good place to copy from might be the recent loop-unswitch
> work Chandler did.
>
> Philip
>

I don't think preserving the analyses is extremely hard (but I may be
wrong on this).
The incremental Dominator tree API makes the updates fairly easy.
LCSSA is slightly more complicated.
If you take a look at the new LoopUnswitch, in fact, it does call
recalculate rather than fixing it incrementally.
But, if LCSSA is computed right, recomputing entirely should take very
little time.

Thanks,

via llvm-dev

unread,
Dec 4, 2017, 6:44:21 PM12/4/17
to Philip Reames, llvm-dev
LoopSimplifyCFG should be updated to do this (I don’t know if it’s in the main pass pipeline yet, though).

When I wrote it I basically only included the simplest possible behavior that I needed with the possibility to add more in the future; I don’t know how much has been added since, of course.

—escha

Anna Thomas via llvm-dev

unread,
Dec 4, 2017, 7:06:26 PM12/4/17
to Davide Italiano, llvm-dev
Hi Davide,

> On Nov 29, 2017, at 1:08 PM, Davide Italiano via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> On Wed, Nov 29, 2017 at 9:48 AM, Philip Reames
> <list...@philipreames.com> wrote:
>> There's already a LoopSimplifyCFG which is a loop-pass (and thus can iterate
>> with other loop passes) and eliminates trivial branches. Having a simple
>> pass which just strips unreachable blocks and converts conditional branches
>> to unconditional ones while updating appropriate analyzes (LoopInfo,
>> DomTree, LCSSA, etc..) seems very reasonable.
>
> I'm not necessarily convinced having one-trick-pony pass is the best
> option here.
> It adds complexity, and there's already a pass in-tree which can
> provide this functionality.
> What are your concerns about running SimplifyCFG more often? It
> shouldn't be a compile-time sink
The main issue with running SimplifyCFG more often is if we were to place the pass
*in between* a bunch of loop passes. Atleast in the Legacy pass manager, I think this breaks the nice
loop caching we have when placing all loop passes together.
I’ve not done any measurements on compile time to verify this theory.

LoopSimplifyCFG being a loop pass will not have this problem.

>
>> This could also be a utility
>> function called from the end of other passes. The hard bit is the analysis
>> preservation. A good place to copy from might be the recent loop-unswitch
>> work Chandler did.
>>
>> Philip
>>
>
> I don't think preserving the analyses is extremely hard (but I may be
> wrong on this).
> The incremental Dominator tree API makes the updates fairly easy.
> LCSSA is slightly more complicated.
> If you take a look at the new LoopUnswitch, in fact, it does call
> recalculate rather than fixing it incrementally.
> But, if LCSSA is computed right, recomputing entirely should take very
> little time.

The trouble (when I tried this transform in LoopSimplifyCFG) is updating
loopInfo analysis: we can have cases where eliminating trivial branches in an inner loop
will break the outer loop structure, and can cause it to no longer be a loop (consider a perfectly nested
inner loop within an outer loop, where the trivial branch is br i1 false, label %outerloopheader, label %innerloopHdr).

Another problem is if we changed the branches to unconditional ones, which then
makes some inner loop unreachable (i.e. no longer loops in LLVM terminology).

We will need to record these kinds of loops and update the LPM. There maybe more here.

Thanks,
Anna

Anna Thomas via llvm-dev

unread,
Feb 13, 2018, 8:55:20 AM2/13/18
to llvm-dev
[ + cc llvm-dev]. 

This came up again internally and I realized I hadn’t cc’ed llvm-dev in my original response.


Anna

Michael Zolotukhin via llvm-dev

unread,
Feb 18, 2018, 7:51:47 PM2/18/18
to Anna Thomas, llvm-dev
I agree that LoopSimplifyCFG is the right place to do this. As Anna pointed out, it’s not as easy as to just run SimplifyCFG on a blocks belonging to the current loop: updating loops structure and preserving LCSSA make it challenging, but in the end it not only should solve the original issue mentioned in this thread (eliminating br i1 false), but it also can help compile time by reducing IR size.

Michael
Reply all
Reply to author
Forward
0 new messages