Can you please elaborate on the following statement?
>>” I've replicated its functionality with a much updated transformation algorithm (no longer O(N^2), and finds better orderings!)”
Is this a new algorithm or something you can refer to. Thanks.
Sergei
Hello,
Thoughts?
_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> Hello,
>
> I'm working on basic-block placement optimizations based on branch probability information. I've run into a stumbling block though. One of the existing passes to do this, essentially a dead pass 'block-placement' operates on the IR, reordering the blocks of the function, and relying on the code generator to preserve that order unless it has a reason to rearrange it. This pass is entirely untested AFAICT, and it doesn't work any more...
>
> That said, I've replicated its functionality with a much updated transformation algorithm (no longer O(N^2), and finds better orderings!) and predicated it on the new frequency and probability information. It "works" great, but it doesn't actually optimize anything. It re-arranges all the blocks of the IR function, laying them out perfectly. And then the code generation layer throws that away and seems to establish its own ordering altogether.
>
> What's the best approach here? Is ignoring the order of blocks in the IR function desired behavior? (It does seem reasonable, it just seems to have not always been that way, so I wondered.)
>
> Should I sink my pass down to a codegen pass?
I think this should really live as a CodeGen pass. Is there any good reason to make it an IR pass?
Cameron
On Oct 18, 2011, at 2:53 AM, Chandler Carruth wrote:I think this should really live as a CodeGen pass. Is there any good reason to make it an IR pass?
> Hello,
>
> I'm working on basic-block placement optimizations based on branch probability information. I've run into a stumbling block though. One of the existing passes to do this, essentially a dead pass 'block-placement' operates on the IR, reordering the blocks of the function, and relying on the code generator to preserve that order unless it has a reason to rearrange it. This pass is entirely untested AFAICT, and it doesn't work any more...
>
> That said, I've replicated its functionality with a much updated transformation algorithm (no longer O(N^2), and finds better orderings!) and predicated it on the new frequency and probability information. It "works" great, but it doesn't actually optimize anything. It re-arranges all the blocks of the IR function, laying them out perfectly. And then the code generation layer throws that away and seems to establish its own ordering altogether.
>
> What's the best approach here? Is ignoring the order of blocks in the IR function desired behavior? (It does seem reasonable, it just seems to have not always been that way, so I wondered.)
>
> Should I sink my pass down to a codegen pass?
> As for why it should be an IR pass, mostly because once the selection dag runs through the code, we can never recover all of the freedom we have at the IR level. To start with, splicing MBBs around requires known about the terminators (which we only some of the time do), and it requires re-writing them a touch to account for the different fall-through pattern. To make matters worse, at this point we don't have the nicely analyzable 'switch' terminator (I think), and so the existing MBB placement code just bails on non-branch-exit blocks.
We should be able to analyze more terminators in the backend. Most other compilers have this ability. If we don't fix this problem now, we will only increase the distance between the functionality of the backend IR and the mid-level IR.
> Operating at the IR level makes writing the pass a breeze, and we can quickly and efficiently simply lay the blocks out in the ideal ordering. Then the SelectionDAG and other layers can preserve this modulo folding and other optimization opportunities.
The biggest problem with doing block layout at the IR level is that you don't have the final CFG. Lots of passes can modify the CFG, and they will have to rely on the sorts of local heuristics that already exist (and are known to perform poorly in some cases) to redo block layout after these changes. You also can't represent jump fall-throughs in the mid-level IR (nor should you be able to).
On Tue, Oct 18, 2011 at 2:59 PM, Cameron Zwarich <zwa...@apple.com> wrote:I think this should really live as a CodeGen pass. Is there any good reason to make it an IR pass?So, as it happens, I was *completely* wrong here. CodeGen correctly preserves the ordering of blocks from IR, *unless* it can do folding, etc.
As for why it should be an IR pass, mostly because once the selection dag runs through the code, we can never recover all of the freedom we have at the IR level. To start with, splicing MBBs around requires known about the terminators (which we only some of the time do), and it requires re-writing them a touch to account for the different fall-through pattern. To make matters worse, at this point we don't have the nicely analyzable 'switch' terminator (I think), and so the existing MBB placement code just bails on non-branch-exit blocks.
On Oct 18, 2011, at 3:07 PM, Chandler Carruth wrote:On Tue, Oct 18, 2011 at 2:59 PM, Cameron Zwarich <zwa...@apple.com> wrote:
I think this should really live as a CodeGen pass. Is there any good reason to make it an IR pass?So, as it happens, I was *completely* wrong here. CodeGen correctly preserves the ordering of blocks from IR, *unless* it can do folding, etc.That's right. However, the CFG changes quite a bit during CodeGen.Switches can be lowered into branch trees, multiple passes can split critical edges, and then there is taildup and tailmerge.An IR code layout algorithm simply doesn't know the final CFG.
As for why it should be an IR pass, mostly because once the selection dag runs through the code, we can never recover all of the freedom we have at the IR level. To start with, splicing MBBs around requires known about the terminators (which we only some of the time do), and it requires re-writing them a touch to account for the different fall-through pattern. To make matters worse, at this point we don't have the nicely analyzable 'switch' terminator (I think), and so the existing MBB placement code just bails on non-branch-exit blocks.Those are all the wrong reasons for not doing the right thing.
Some basic blocks are glued together and must be placed next to each other. That situation can be recognized by "MBB->canFallThrough() && TII->AnalyzeBranch(MBB..)".Treat glued-together blocks as super-blocks, and everything should be as breezy as IR.
Treat glued-together blocks as super-blocks, and everything should be as breezy as IR.But that's just the thing -- a primary goal of this pass would be to *change* the fall-through pattern.
Also, it's still not clear to me how to analyze switches in CodeGen, but that's likely my lack of having read the appropriate interfaces thoroughly.
As for why it should be an IR pass, mostly because once the selection dag runs through the code, we can never recover all of the freedom we have at the IR level. To start with, splicing MBBs around requires known about the terminators (which we only some of the time do), and it requires re-writing them a touch to account for the different fall-through pattern. To make matters worse, at this point we don't have the nicely analyzable 'switch' terminator (I think), and so the existing MBB placement code just bails on non-branch-exit blocks.Those are all the wrong reasons for not doing the right thing.Sorry, I'm not trying to do the wrong thing because of this... Currently, it feels like a trade-off in terms of cost/benefit. It's not yet clear to me that the benefit of doing this analysis in the CodeGen layer outweighs the cost and I was trying to clarify what the costs I perceive are.
On Oct 18, 2011, at 5:22 PM, Chandler Carruth wrote:As for why it should be an IR pass, mostly because once the selection dag runs through the code, we can never recover all of the freedom we have at the IR level. To start with, splicing MBBs around requires known about the terminators (which we only some of the time do), and it requires re-writing them a touch to account for the different fall-through pattern. To make matters worse, at this point we don't have the nicely analyzable 'switch' terminator (I think), and so the existing MBB placement code just bails on non-branch-exit blocks.Those are all the wrong reasons for not doing the right thing.Sorry, I'm not trying to do the wrong thing because of this... Currently, it feels like a trade-off in terms of cost/benefit. It's not yet clear to me that the benefit of doing this analysis in the CodeGen layer outweighs the cost and I was trying to clarify what the costs I perceive are.I think it's mostly about understanding how MBBs work.
Ignoring calls and returns, most machines have three kinds of branches:1. Unconditional2. Conditional3. Indirect.The AnalyzeBranch() function understands the first two kinds, so if that function returns false (as in it's false that it didn't succeed) you can move the successors around, and you know that placing a successor immediately after the block and calling updateTerminator() will give you a fall-through.If AnalyzeBranch() fails, you can still check if the last instruction in the block is an unpredicated barrier. If so, it is still safe to move the successors around, but that block will never be a fall-through. The canFallThrough() function implements this check.If the last instruction in the block is predicated or not a barrier, you must keep it together with its layout successor. This should only happen in rare cases where it is necessary. For example, I am planning to lower invoke instructions into call instructions that are terminators. This is necessary to accurately model control flow to landing pads. Such a call instruction must fall through to its layout successor.Some experimental targets don't implement AnalyzeBranch, so everything looks like an indirect branch. Those targets get the code placement they deserve.I am not claiming the API is awesome, but the information you need is there, and you have the same freedom as for IR.We explicitly designed the branch weights so switch lowering could annotate all the new branches with exact weights. It would be a shame to ignore that information.So the benefits are:- Profile-driven fall-through layout of lowered switches. That should be a pretty big deal.- Proper placement of split critical edges.- The ability to implement stuff like: "Don't put too many branches in a fetch group, or you'll freak out the branch predictor".
Ok, wow that wasn't hard at all. This is still *very* much a rough draft, but it's probably better to review that the previous patch. One big caveat, I know I have an iteration bug in here somewhere that is inf-looping. Just ran out of steam debugging it, will pick it back up again later today to shake it out.
Still, for the test cases that don't tickle the iteration bug, it generates beautiful, well laid out code. =]
> Ok, wow that wasn't hard at all.
Awesome ;-)
> This is still *very* much a rough draft, but it's probably better to review that the previous patch. One big caveat, I know I have an iteration bug in here somewhere that is inf-looping. Just ran out of steam debugging it, will pick it back up again later today to shake it out.
Maybe splicing a block into it current position will create a loop?
Some random notes:
- Please add a description of the algorithm.
- Please add a comment to the BlockChain class.
- Use a separate anonymous namespace per class, and don't indent for the namespace.
+BlockChain *BlockPlacement2::CreateChain(MachineBasicBlock *BB) {
+ Chains.push_back(BlockChain(BlockToChain, BB));
+ BlockToChain[BB] = &Chains.back();
+ assert(ActiveChains.insert(&Chains.back()));
+ return &Chains.back();
+}
Whoa, you are storing pointers into a growing vector. You should at least assert(Chains.size() != Chains.capacity()) before pushing.
+ RequiredChain->BBEnd = ++MachineFunction::iterator(From);
Does C++ allow this these days? I would prefer llvm::next(MachineFunction::iterator(From))
Note that MachineFunction::iterator decays into an MBB pointer, so you can say FI->canFallThrough() and AnalyzeBranch(*FI...)
+ WeightedEdge WE = { BaseFrequency * MBPI->getEdgeProbability(From, To),
Did we add API for this computation? We intended to add a function that computes this and saturates on overflow etc.
> Still, for the test cases that don't tickle the iteration bug, it generates beautiful, well laid out code. =]
I am sure others have strong opinions about the algorithm ;-)
I would like to see a more explicit handling of loop layout. There is a tradeoff between entering the loop with a branch, and having the exit block last. It is not clear to me that this algorithm handles that properly.
Be careful about placing too much trust in the behavior of branch probabilities. They go out of whack when they saturate.
Second, you should handle blocks that can never fall through (indirect branches from jump tables). There is no need to try to place their successors.
/jakob
I agree. The current CodePlacementOpt pass handles blocks layout within a loop. The new BB placement optimization should be integrated with it.
Evan
Overflow is handled transparently in the overloaded BlockFrequency::operator*(BranchProbability). But you could use the existing API anyway by adding a helper to MachineBlockFrequencyInfo calling BlockFrequencyImpl::getEdgeFreq(Src,Dst).
>> Still, for the test cases that don't tickle the iteration bug, it generates beautiful, well laid out code. =]
>
> I am sure others have strong opinions about the algorithm ;-)
I may be one of those others ;-). Although handling SCCs (loops) probably saves Chandler's design from going too far off the rails. I don't remember that being in the published algorithm.
> I would like to see a more explicit handling of loop layout. There is a tradeoff between entering the loop with a branch, and having the exit block last. It is not clear to me that this algorithm handles that properly.
How do you ensure that BlockChains start at loop headers?
How do you ensure that loop exits are laid out following the loop body, especially in the presence of critical edges?
Why not layout blocks according to MachineLoopInfo? The SCC abstraction seems overly generic and unnecessary.
When you merge chains, why don't you delete the edge between the chains?
Why do you run profile guided block layout after the existing CodePlacementOpt? Shouldn't it be the other way around so that CodePlacementOpt can cleanup loops, which BlockPlacement2 doesn't handle well? I think the answer is that BlockPlacement2 actually depends on loops being laid out sensibly before running, but that needs to be explained.
>
> Be careful about placing too much trust in the behavior of branch probabilities. They go out of whack when they saturate.
>
> Second, you should handle blocks that can never fall through (indirect branches from jump tables). There is no need to try to place their successors.
Please put this under some flag for now. Enabling it by default opens up a whole set of issues that are premature to discuss. Until the overall code layout strategy is more coherent, people will only want to enable profile guided layout when they have complete profile info, or really need to take full advantage of builtin expect. If you're not one of those people, you'll want to leave this disabled to avoid unnecessary misery when debugging generated disassembly, and reduce the number of places that codegen makes semi-random decisions that perturb the code.
Otherwise, the code looks nice! Good comments.
-Andy
On Oct 19, 2011, at 7:56 AM, Jakob Stoklund Olesen wrote:
>> This is still *very* much a rough draft, but it's probably better to review that the previous patch. One big caveat, I know I have an iteration bug in here somewhere that is inf-looping. Just ran out of steam debugging it, will pick it back up again later today to shake it out.
>
> Some random notes:
>
> - Please add a description of the algorithm.
> - Please add a comment to the BlockChain class.
> - Use a separate anonymous namespace per class, and don't indent for the namespace.
>...
> Note that MachineFunction::iterator decays into an MBB pointer, so you can say FI->canFallThrough() and AnalyzeBranch(*FI...)
>Overflow is handled transparently in the overloaded BlockFrequency::operator*(BranchProbability). But you could use the existing API anyway by adding a helper to MachineBlockFrequencyInfo calling BlockFrequencyImpl::getEdgeFreq(Src,Dst).
> + WeightedEdge WE = { BaseFrequency * MBPI->getEdgeProbability(From, To),
>
> Did we add API for this computation? We intended to add a function that computes this and saturates on overflow etc.
I may be one of those others ;-). Although handling SCCs (loops) probably saves Chandler's design from going too far off the rails. I don't remember that being in the published algorithm.
>> Still, for the test cases that don't tickle the iteration bug, it generates beautiful, well laid out code. =]
>
> I am sure others have strong opinions about the algorithm ;-)
> I would like to see a more explicit handling of loop layout. There is a tradeoff between entering the loop with a branch, and having the exit block last. It is not clear to me that this algorithm handles that properly.
How do you ensure that BlockChains start at loop headers?
How do you ensure that loop exits are laid out following the loop body, especially in the presence of critical edges?
Why not layout blocks according to MachineLoopInfo? The SCC abstraction seems overly generic and unnecessary.
When you merge chains, why don't you delete the edge between the chains?
Why do you run profile guided block layout after the existing CodePlacementOpt? Shouldn't it be the other way around so that CodePlacementOpt can cleanup loops, which BlockPlacement2 doesn't handle well?
I think the answer is that BlockPlacement2 actually depends on loops being laid out sensibly before running, but that needs to be explained.
> Be careful about placing too much trust in the behavior of branch probabilities. They go out of whack when they saturate.Please put this under some flag for now. Enabling it by default opens up a whole set of issues that are premature to discuss. Until the overall code layout strategy is more coherent, people will only want to enable profile guided layout when they have complete profile info, or really need to take full advantage of builtin expect. If you're not one of those people, you'll want to leave this disabled to avoid unnecessary misery when debugging generated disassembly, and reduce the number of places that codegen makes semi-random decisions that perturb the code.
>
> Second, you should handle blocks that can never fall through (indirect branches from jump tables). There is no need to try to place their successors.
Otherwise, the code looks nice! Good comments.
> A new patch is attached that is *much* less of a rough draft. Sorry for any confusion due to the early state of the patch.
Thanks, Chandler. This is great stuff.
> Still, I never intended this to be on-by-default at first. This is a starting point, that I hope can be improved into something that is on-by-default eventually, but I'll be the first to say it's almost certainly not ready for that just yet.
I would like to see this go in ASAP under a flag. I prefer to see development as commits rather than a series of updated patches.
Could you rename it to MachineBlockPlacement.cpp first, though? That makes it clear it's a CodeGen pass, and the BlockPlacement2 name is icky.
> I'm more hopeful that we can delete the existing block placement pass, and direct anyone interested in profile file guided stuff to write a simple pass to load profiles into metadata. I suspect this pass is already superior to that one.
I also see it as a replacement for CodePlacementOpt.cpp, so I think your flag should run your pass instead of CodePlacementOpt rather than before or after it.
You should simply clone the loop alignment stuff, it's pretty trivial.
/jakob
Thanks, Chandler. This is great stuff.
On Oct 20, 2011, at 9:56 AM, Chandler Carruth wrote:
> A new patch is attached that is *much* less of a rough draft. Sorry for any confusion due to the early state of the patch.
I would like to see this go in ASAP under a flag. I prefer to see development as commits rather than a series of updated patches.
> Still, I never intended this to be on-by-default at first. This is a starting point, that I hope can be improved into something that is on-by-default eventually, but I'll be the first to say it's almost certainly not ready for that just yet.
Could you rename it to MachineBlockPlacement.cpp first, though? That makes it clear it's a CodeGen pass, and the BlockPlacement2 name is icky.
> I'm more hopeful that we can delete the existing block placement pass, and direct anyone interested in profile file guided stuff to write a simple pass to load profiles into metadata. I suspect this pass is already superior to that one.I also see it as a replacement for CodePlacementOpt.cpp, so I think your flag should run your pass instead of CodePlacementOpt rather than before or after it.
You should simply clone the loop alignment stuff, it's pretty trivial.
On Thu, Oct 20, 2011 at 10:53 AM, Jakob Stoklund Olesen <stok...@2pi.dk> wrote:On Oct 20, 2011, at 9:56 AM, Chandler Carruth wrote:Thanks, Chandler. This is great stuff.
> A new patch is attached that is *much* less of a rough draft. Sorry for any confusion due to the early state of the patch.
I would like to see this go in ASAP under a flag. I prefer to see development as commits rather than a series of updated patches.
> Still, I never intended this to be on-by-default at first. This is a starting point, that I hope can be improved into something that is on-by-default eventually, but I'll be the first to say it's almost certainly not ready for that just yet.
Awesome, same way I feel. Checked in as r142641. I've got some generic cleanup I'll go ahead and commit to it over the next few days.
Done, and in progress. SHould have the loop alignment stuff cloned right away, and then i'll start looking at the loop structure issues. I'll probably have some questions for you and/or andy about exactly how that should work, whether CodePlacementOpt is doing the right thing, etc.
Once you decide to break these constraints, you have effectivelydesignated the offending paths as "cold". It's up to you how todecide which paths are cold. You'll need some threshold based onconfidence in the profile, and we can measure the effectiveness ofyour heuristics and add more over time.
I don't care much about how you layout the cold chains as long as theyare not interleaved with the rest of the code, thus breaking thetopological ordering and forcing extra branches. In practice, thismeans they get their own little Siberia after the function return,effectively splitting the function into two sections. However, thenotion of "coldness" is really relative to the current loop head.
2) Construct chains of blocks within the loop. Here the "entry" chainis the loop header, or function entry. You can probably representeach already laid out inner loop as a single chain. The graph ofchains within a loop is *acyclic* except for irreducible controlflow, which doesn't have any structure to preserve anyway. There'sno need to compute SCCs.
4) Splice the chains together, preserving topological order. Assigningeach block to a DFS reverse-post-order number provides a niceordering that is easy to compare or sort.
Let me give you a little example to think about:A:br B(80%), C(20%)B:br DC:br D(90%), E(10%)D:br FE:br FF:retThis is a fine topological sort but bad layout based on expectedbranch probability (even with low confidence in the profile).The only good layout that preserves topological ordering is:A, B, C, E, D, F
Finally, we need some way to validate the design and verify theimplementation. Weighted statistics will be very useful here, similarto those that Jakob used in the register allocator. For example:For each nonadjacent edge:TakenBranchFrequency += frequency(Edge)This metric will prefer the most "extreme" layout because it assumesperfect accuracy of block frequencies. It would be nice to look at thesame metric computed from skewed branch probabilities to see howsensitive it is to small shifts in branch behavior.Taken branch frequency is the most obvious, but doesn't completelycapture locality. You could also measure distance jumped fairlyeasily, also weighted by edge frequency.It might be useful to gather these stats as late as possible to takeinto consideration the final impact of layout, including the effect onany later passes.
Ok, I think I have a working pass that is based much more on what we've talked about here. The patch is attached. I'd love to commit it soon-ish and then start tweaking it based on feedback from you, others, and looking at how it actually works in the wild.
Let me give you a little example to think about:A:br B(80%), C(20%)B:br DC:br D(90%), E(10%)D:br FE:br FF:retThis is a fine topological sort but bad layout based on expectedbranch probability (even with low confidence in the profile).The only good layout that preserves topological ordering is:A, B, C, E, D, FI made a test case out of this, and I think it works. It's a bit confusing at it just replaces branches to F with ret. That removes the edge from D to F, and makes the following ordering also viable: A, B, C, D, E. That's the ordering it chooses, and it also duplicates D up into B instead of branching from B to D. :: shrug ::. It looks pretty good to me though. =]
I don't care much about how you layout the cold chains as long as theyare not interleaved with the rest of the code, thus breaking thetopological ordering and forcing extra branches. In practice, thismeans they get their own little Siberia after the function return,effectively splitting the function into two sections. However, thenotion of "coldness" is really relative to the current loop head.I've not really tried to ensure this happens correctly. Mostly, this is just forming hot-paths through the code when they are *really* hot. I'll work on better sectioning off the cold paths that are cut off in a followup patch.