[LLVMdev] Question regarding basic-block placement optimization

Chandler Carruth

unread,

Oct 18, 2011, 5:53:52 AM10/18/11

to LLVM Developers Mailing List

Hello,

I'm working on basic-block placement optimizations based on branch probability information. I've run into a stumbling block though. One of the existing passes to do this, essentially a dead pass 'block-placement' operates on the IR, reordering the blocks of the function, and relying on the code generator to preserve that order unless it has a reason to rearrange it. This pass is entirely untested AFAICT, and it doesn't work any more...

That said, I've replicated its functionality with a much updated transformation algorithm (no longer O(N^2), and finds better orderings!) and predicated it on the new frequency and probability information. It "works" great, but it doesn't actually optimize anything. It re-arranges all the blocks of the IR function, laying them out perfectly. And then the code generation layer throws that away and seems to establish its own ordering altogether.

What's the best approach here? Is ignoring the order of blocks in the IR function desired behavior? (It does seem reasonable, it just seems to have not always been that way, so I wondered.)

Should I sink my pass down to a codegen pass?

If this really should live as a CodeGen pass, then what should become of the existing code-placement pass in the CodeGen layer? Currently that pass does two things:

1) It aligns loop bodies. This seems like a completely separate concern, although likely a good one to do after any other placement optimizations.

2) It rearranges very specific loop structures using very specific heuristics. Some of these seem redundant in the presence of probability based information, but others seem orthogonal. Honestly, I don't understand any of this very well, so it's likely I'm just misreading it.

I'm a bit wary of implementing this in the CodeGen layer as I worry that the layout of the MBBs and their various branches will have already made assumptions regarding the structure and shape of the CFG of the function... But it would also perhaps give more stability to the resulting orderings, improving the benefits of the pass.

Thoughts?

Sergei Larin

unread,

Oct 18, 2011, 11:37:07 AM10/18/11

to Chandler Carruth, LLVM Developers Mailing List

Chandler,

Can you please elaborate on the following statement?

>>” I've replicated its functionality with a much updated transformation algorithm (no longer O(N^2), and finds better orderings!)”

Is this a new algorithm or something you can refer to. Thanks.

Sergei

Hello,

Thoughts?

_______________________________________________
LLVM Developers mailing list
LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Cameron Zwarich

unread,

Oct 18, 2011, 5:59:48 PM10/18/11

to Chandler Carruth, LLVM Developers Mailing List

On Oct 18, 2011, at 2:53 AM, Chandler Carruth wrote:

> Hello,
>
> I'm working on basic-block placement optimizations based on branch probability information. I've run into a stumbling block though. One of the existing passes to do this, essentially a dead pass 'block-placement' operates on the IR, reordering the blocks of the function, and relying on the code generator to preserve that order unless it has a reason to rearrange it. This pass is entirely untested AFAICT, and it doesn't work any more...
>
> That said, I've replicated its functionality with a much updated transformation algorithm (no longer O(N^2), and finds better orderings!) and predicated it on the new frequency and probability information. It "works" great, but it doesn't actually optimize anything. It re-arranges all the blocks of the IR function, laying them out perfectly. And then the code generation layer throws that away and seems to establish its own ordering altogether.
>
> What's the best approach here? Is ignoring the order of blocks in the IR function desired behavior? (It does seem reasonable, it just seems to have not always been that way, so I wondered.)
>
> Should I sink my pass down to a codegen pass?

I think this should really live as a CodeGen pass. Is there any good reason to make it an IR pass?

Cameron

Chandler Carruth

unread,

Oct 18, 2011, 6:07:09 PM10/18/11

to Cameron Zwarich, LLVM Developers Mailing List

On Tue, Oct 18, 2011 at 2:59 PM, Cameron Zwarich <zwa...@apple.com> wrote:

On Oct 18, 2011, at 2:53 AM, Chandler Carruth wrote:

> Hello,
>
> I'm working on basic-block placement optimizations based on branch probability information. I've run into a stumbling block though. One of the existing passes to do this, essentially a dead pass 'block-placement' operates on the IR, reordering the blocks of the function, and relying on the code generator to preserve that order unless it has a reason to rearrange it. This pass is entirely untested AFAICT, and it doesn't work any more...
>
> That said, I've replicated its functionality with a much updated transformation algorithm (no longer O(N^2), and finds better orderings!) and predicated it on the new frequency and probability information. It "works" great, but it doesn't actually optimize anything. It re-arranges all the blocks of the IR function, laying them out perfectly. And then the code generation layer throws that away and seems to establish its own ordering altogether.
>
> What's the best approach here? Is ignoring the order of blocks in the IR function desired behavior? (It does seem reasonable, it just seems to have not always been that way, so I wondered.)
>
> Should I sink my pass down to a codegen pass?

I think this should really live as a CodeGen pass. Is there any good reason to make it an IR pass?

So, as it happens, I was *completely* wrong here. CodeGen correctly preserves the ordering of blocks from IR, *unless* it can do folding, etc. My original test cases ended up being folded into other structures, hiding the fact that the ordering wasn't what was changing.

I've run much more real-world style test cases through the pass, and it is working beautifully. I'll be attaching a patch shortly, I'm currently cleaning it up based on some over-the-shoulder feedback from Nick Lewycky.

As for why it should be an IR pass, mostly because once the selection dag runs through the code, we can never recover all of the freedom we have at the IR level. To start with, splicing MBBs around requires known about the terminators (which we only some of the time do), and it requires re-writing them a touch to account for the different fall-through pattern. To make matters worse, at this point we don't have the nicely analyzable 'switch' terminator (I think), and so the existing MBB placement code just bails on non-branch-exit blocks.

Operating at the IR level makes writing the pass a breeze, and we can quickly and efficiently simply lay the blocks out in the ideal ordering. Then the SelectionDAG and other layers can preserve this modulo folding and other optimization opportunities.

At the most, we should likely add probability awareness to some of these CodeGen passes to turn off (or on) their transforms around unlikely (or likely) edges.

Cameron Zwarich

unread,

Oct 18, 2011, 7:18:56 PM10/18/11

to Chandler Carruth, LLVM Developers Mailing List

On Oct 18, 2011, at 3:07 PM, Chandler Carruth wrote:

> As for why it should be an IR pass, mostly because once the selection dag runs through the code, we can never recover all of the freedom we have at the IR level. To start with, splicing MBBs around requires known about the terminators (which we only some of the time do), and it requires re-writing them a touch to account for the different fall-through pattern. To make matters worse, at this point we don't have the nicely analyzable 'switch' terminator (I think), and so the existing MBB placement code just bails on non-branch-exit blocks.

We should be able to analyze more terminators in the backend. Most other compilers have this ability. If we don't fix this problem now, we will only increase the distance between the functionality of the backend IR and the mid-level IR.

> Operating at the IR level makes writing the pass a breeze, and we can quickly and efficiently simply lay the blocks out in the ideal ordering. Then the SelectionDAG and other layers can preserve this modulo folding and other optimization opportunities.

The biggest problem with doing block layout at the IR level is that you don't have the final CFG. Lots of passes can modify the CFG, and they will have to rely on the sorts of local heuristics that already exist (and are known to perform poorly in some cases) to redo block layout after these changes. You also can't represent jump fall-throughs in the mid-level IR (nor should you be able to).

Jakob Stoklund Olesen

unread,

Oct 18, 2011, 7:31:39 PM10/18/11

to Chandler Carruth, LLVM Developers Mailing List

On Oct 18, 2011, at 3:07 PM, Chandler Carruth wrote:

On Tue, Oct 18, 2011 at 2:59 PM, Cameron Zwarich <zwa...@apple.com> wrote:

I think this should really live as a CodeGen pass. Is there any good reason to make it an IR pass?

So, as it happens, I was *completely* wrong here. CodeGen correctly preserves the ordering of blocks from IR, *unless* it can do folding, etc.

That's right. However, the CFG changes quite a bit during CodeGen.

Switches can be lowered into branch trees, multiple passes can split critical edges, and then there is taildup and tailmerge.

An IR code layout algorithm simply doesn't know the final CFG.

As for why it should be an IR pass, mostly because once the selection dag runs through the code, we can never recover all of the freedom we have at the IR level. To start with, splicing MBBs around requires known about the terminators (which we only some of the time do), and it requires re-writing them a touch to account for the different fall-through pattern. To make matters worse, at this point we don't have the nicely analyzable 'switch' terminator (I think), and so the existing MBB placement code just bails on non-branch-exit blocks.

Those are all the wrong reasons for not doing the right thing.

Some basic blocks are glued together and must be placed next to each other. That situation can be recognized by "MBB->canFallThrough() && TII->AnalyzeBranch(MBB..)".

Treat glued-together blocks as super-blocks, and everything should be as breezy as IR.

I realize the MBB interface is not the prettiest. Suggestions for cleaning it up are very welcome.

/jakob

Chandler Carruth

unread,

Oct 18, 2011, 8:22:15 PM10/18/11

to Jakob Stoklund Olesen, LLVM Developers Mailing List

On Tue, Oct 18, 2011 at 4:31 PM, Jakob Stoklund Olesen <stok...@2pi.dk> wrote:

On Oct 18, 2011, at 3:07 PM, Chandler Carruth wrote:

On Tue, Oct 18, 2011 at 2:59 PM, Cameron Zwarich <zwa...@apple.com> wrote:

I think this should really live as a CodeGen pass. Is there any good reason to make it an IR pass?

So, as it happens, I was *completely* wrong here. CodeGen correctly preserves the ordering of blocks from IR, *unless* it can do folding, etc.

That's right. However, the CFG changes quite a bit during CodeGen.

Switches can be lowered into branch trees, multiple passes can split critical edges, and then there is taildup and tailmerge.

An IR code layout algorithm simply doesn't know the final CFG.

To be clear, I don't disagree with any of this. =] However, my rough experiments thus far (hoping to have real benchmark data soon) seem to indicate that just giving a baseline ordering of block to the CodeGen layer

As for why it should be an IR pass, mostly because once the selection dag runs through the code, we can never recover all of the freedom we have at the IR level. To start with, splicing MBBs around requires known about the terminators (which we only some of the time do), and it requires re-writing them a touch to account for the different fall-through pattern. To make matters worse, at this point we don't have the nicely analyzable 'switch' terminator (I think), and so the existing MBB placement code just bails on non-branch-exit blocks.

Those are all the wrong reasons for not doing the right thing.

Sorry, I'm not trying to do the wrong thing because of this... Currently, it feels like a trade-off in terms of cost/benefit. It's not yet clear to me that the benefit of doing this analysis in the CodeGen layer outweighs the cost and I was trying to clarify what the costs I perceive are.

Some basic blocks are glued together and must be placed next to each other. That situation can be recognized by "MBB->canFallThrough() && TII->AnalyzeBranch(MBB..)".

Treat glued-together blocks as super-blocks, and everything should be as breezy as IR.

But that's just the thing -- a primary goal of this pass would be to *change* the fall-through pattern. Currently, that can be done very easily, although to a limited extent, by changing the IR which enters the selection dag.

Maybe what we need is to have this pass at both layers? Then the codegen layer can work on the glued-together blocks to check for (and correct) any inappropriate CFG changes made in the intervening passes?

Also, it's still not clear to me how to analyze switches in CodeGen, but that's likely my lack of having read the appropriate interfaces thoroughly.

Jakob Stoklund Olesen

unread,

Oct 18, 2011, 8:47:14 PM10/18/11

to Chandler Carruth, LLVM Developers Mailing List

On Oct 18, 2011, at 5:22 PM, Chandler Carruth wrote:

Treat glued-together blocks as super-blocks, and everything should be as breezy as IR.

But that's just the thing -- a primary goal of this pass would be to *change* the fall-through pattern.

That's not a problem. I wasn't talking about normal fall-through blocks. You simply call MBB->updateTerminator() for those.

Also, it's still not clear to me how to analyze switches in CodeGen, but that's likely my lack of having read the appropriate interfaces thoroughly.

Switches aren't real, so they don't exist in CodeGen.

Parts of switches can be lowered to jump tables and indirect branches.

An indirect branch will cause AnalyzeBranch() to fail, but canFallThrough() will still return false, and it is safe to move the successors around. This also works for computed goto.

/jakob

Jakob Stoklund Olesen

unread,

Oct 18, 2011, 9:58:53 PM10/18/11

to Chandler Carruth, LLVM Developers Mailing List

On Oct 18, 2011, at 5:22 PM, Chandler Carruth wrote:

As for why it should be an IR pass, mostly because once the selection dag runs through the code, we can never recover all of the freedom we have at the IR level. To start with, splicing MBBs around requires known about the terminators (which we only some of the time do), and it requires re-writing them a touch to account for the different fall-through pattern. To make matters worse, at this point we don't have the nicely analyzable 'switch' terminator (I think), and so the existing MBB placement code just bails on non-branch-exit blocks.

Those are all the wrong reasons for not doing the right thing.

Sorry, I'm not trying to do the wrong thing because of this... Currently, it feels like a trade-off in terms of cost/benefit. It's not yet clear to me that the benefit of doing this analysis in the CodeGen layer outweighs the cost and I was trying to clarify what the costs I perceive are.

I think it's mostly about understanding how MBBs work.

Ignoring calls and returns, most machines have three kinds of branches:

1. Unconditional

2. Conditional

3. Indirect.

The AnalyzeBranch() function understands the first two kinds, so if that function returns false (as in it's false that it didn't succeed) you can move the successors around, and you know that placing a successor immediately after the block and calling updateTerminator() will give you a fall-through.

If AnalyzeBranch() fails, you can still check if the last instruction in the block is an unpredicated barrier. If so, it is still safe to move the successors around, but that block will never be a fall-through. The canFallThrough() function implements this check.

If the last instruction in the block is predicated or not a barrier, you must keep it together with its layout successor. This should only happen in rare cases where it is necessary. For example, I am planning to lower invoke instructions into call instructions that are terminators. This is necessary to accurately model control flow to landing pads. Such a call instruction must fall through to its layout successor.

Some experimental targets don't implement AnalyzeBranch, so everything looks like an indirect branch. Those targets get the code placement they deserve.

I am not claiming the API is awesome, but the information you need is there, and you have the same freedom as for IR.

We explicitly designed the branch weights so switch lowering could annotate all the new branches with exact weights. It would be a shame to ignore that information.

So the benefits are:

- Profile-driven fall-through layout of lowered switches. That should be a pretty big deal.

- Proper placement of split critical edges.

- The ability to implement stuff like: "Don't put too many branches in a fetch group, or you'll freak out the branch predictor".

/jakob

Chandler Carruth

unread,

Oct 19, 2011, 6:24:13 AM10/19/11

to Jakob Stoklund Olesen, LLVM Developers Mailing List

On Tue, Oct 18, 2011 at 6:58 PM, Jakob Stoklund Olesen <stok...@2pi.dk> wrote:

On Oct 18, 2011, at 5:22 PM, Chandler Carruth wrote:

As for why it should be an IR pass, mostly because once the selection dag runs through the code, we can never recover all of the freedom we have at the IR level. To start with, splicing MBBs around requires known about the terminators (which we only some of the time do), and it requires re-writing them a touch to account for the different fall-through pattern. To make matters worse, at this point we don't have the nicely analyzable 'switch' terminator (I think), and so the existing MBB placement code just bails on non-branch-exit blocks.

Those are all the wrong reasons for not doing the right thing.

Sorry, I'm not trying to do the wrong thing because of this... Currently, it feels like a trade-off in terms of cost/benefit. It's not yet clear to me that the benefit of doing this analysis in the CodeGen layer outweighs the cost and I was trying to clarify what the costs I perceive are.

I think it's mostly about understanding how MBBs work.

Indeed, that seems to be the case. =D Thanks for explaining things below, it helped me a lot.

Ignoring calls and returns, most machines have three kinds of branches:

1. Unconditional
2. Conditional
3. Indirect.

The AnalyzeBranch() function understands the first two kinds, so if that function returns false (as in it's false that it didn't succeed) you can move the successors around, and you know that placing a successor immediately after the block and calling updateTerminator() will give you a fall-through.

If AnalyzeBranch() fails, you can still check if the last instruction in the block is an unpredicated barrier. If so, it is still safe to move the successors around, but that block will never be a fall-through. The canFallThrough() function implements this check.

If the last instruction in the block is predicated or not a barrier, you must keep it together with its layout successor. This should only happen in rare cases where it is necessary. For example, I am planning to lower invoke instructions into call instructions that are terminators. This is necessary to accurately model control flow to landing pads. Such a call instruction must fall through to its layout successor.

Some experimental targets don't implement AnalyzeBranch, so everything looks like an indirect branch. Those targets get the code placement they deserve.

I am not claiming the API is awesome, but the information you need is there, and you have the same freedom as for IR.

We explicitly designed the branch weights so switch lowering could annotate all the new branches with exact weights. It would be a shame to ignore that information.

So the benefits are:

- Profile-driven fall-through layout of lowered switches. That should be a pretty big deal.
- Proper placement of split critical edges.
- The ability to implement stuff like: "Don't put too many branches in a fetch group, or you'll freak out the branch predictor".

These all seem like really good reasons, and your explanation helps me a lot. I'll take a stab at re-implementing this on MBBs. In the mean time, I've attached my patch with the IR-level patch as most of the interesting logic will remain the same. Please be gentle, it's my first proper LLVM pass, and my knowledge of optimization pass research & papers is limited. The part I'm least pleased with is the computation of weight for each chain in an SCC of chains, but so far I've not come up with anything better. =] I've tested this on several ad-hoc test cases, but nothing really thorough. Going to try moving it to the codegen layer first.

One question that remains, mostly to ensure I've understood you correctly: For switches, it is represented as an N-ary conditional terminator, and the targets of the switch can be freely intermingled with other MBBs?

I'll attach an updated patch to work on MBBs when I have one...

block_placement_ir.patch

Chandler Carruth

unread,

Oct 19, 2011, 8:52:29 AM10/19/11

to Jakob Stoklund Olesen, LLVM Developers Mailing List

On Wed, Oct 19, 2011 at 5:50 AM, Chandler Carruth <chan...@google.com> wrote:

Ok, wow that wasn't hard at all. This is still *very* much a rough draft, but it's probably better to review that the previous patch. One big caveat, I know I have an iteration bug in here somewhere that is inf-looping. Just ran out of steam debugging it, will pick it back up again later today to shake it out.

Oh, of course, i'm not updating the terminators yet... I'll add that...

Could that cause an infloop when iterating over MBBs in a function?

Specifically what I'm seeing is a single MBB iterator which when incremented returns itself. Every time. even calling 'dump' on the MachineFunction inf-loops, so clearly I've corrupted the state of the function somehow.

Still, for the test cases that don't tickle the iteration bug, it generates beautiful, well laid out code. =]

Chandler Carruth

unread,

Oct 19, 2011, 8:50:30 AM10/19/11

to Jakob Stoklund Olesen, LLVM Developers Mailing List

Ok, wow that wasn't hard at all. This is still *very* much a rough draft, but it's probably better to review that the previous patch. One big caveat, I know I have an iteration bug in here somewhere that is inf-looping. Just ran out of steam debugging it, will pick it back up again later today to shake it out.

Still, for the test cases that don't tickle the iteration bug, it generates beautiful, well laid out code. =]

% cat ifchain.ll

; RUN: opt < %s -analyze -block-freq | FileCheck %s

declare void @error(i32 %i, i32 %a, i32 %b)

define i32 @test(i32 %i, i32* %a, i32 %b) {

entry:

%gep1 = getelementptr i32* %a, i32 1

%val1 = load i32* %gep1

%cond1 = icmp ugt i32 %val1, 1

br i1 %cond1, label %then1, label %else1, !prof !0

then1:

call void @error(i32 %i, i32 1, i32 %b)

br label %else1

else1:

%gep2 = getelementptr i32* %a, i32 2

%val2 = load i32* %gep2

%cond2 = icmp ugt i32 %val2, 2

br i1 %cond2, label %then2, label %else2, !prof !0

then2:

call void @error(i32 %i, i32 1, i32 %b)

br label %else2

else2:

%gep3 = getelementptr i32* %a, i32 3

%val3 = load i32* %gep3

%cond3 = icmp ugt i32 %val3, 3

br i1 %cond3, label %then3, label %else3, !prof !0

then3:

call void @error(i32 %i, i32 1, i32 %b)

br label %else3

else3:

%gep4 = getelementptr i32* %a, i32 4

%val4 = load i32* %gep4

%cond4 = icmp ugt i32 %val4, 4

br i1 %cond4, label %then4, label %else4, !prof !0

then4:

call void @error(i32 %i, i32 1, i32 %b)

br label %else4

else4:

%gep5 = getelementptr i32* %a, i32 3

%val5 = load i32* %gep5

%cond5 = icmp ugt i32 %val5, 3

br i1 %cond5, label %then5, label %exit, !prof !0

then5:

call void @error(i32 %i, i32 1, i32 %b)

br label %exit

exit:

ret i32 %b

}

!0 = metadata !{metadata !"branch_weights", i32 4, i32 64}

% ./bin/llc -O2 -o - ifchain.ll

.file "ifchain.ll"

.text

.globl test

.align 16, 0x90

.type test,@function

test: # @test

.Ltmp4:

.cfi_startproc

# BB#0: # %entry

pushq %rbp

.Ltmp5:

.cfi_def_cfa_offset 16

pushq %r14

.Ltmp6:

.cfi_def_cfa_offset 24

pushq %rbx

.Ltmp7:

.cfi_def_cfa_offset 32

.Ltmp8:

.cfi_offset %rbx, -32

.Ltmp9:

.cfi_offset %r14, -24

.Ltmp10:

.cfi_offset %rbp, -16

movl %edx, %ebx

movq %rsi, %r14

movl %edi, %ebp

cmpl $2, 4(%r14)

jb .LBB0_2

.LBB0_2: # %else1

cmpl $3, 8(%r14)

jb .LBB0_4

.LBB0_4: # %else2

cmpl $4, 12(%r14)

jb .LBB0_6

.LBB0_6: # %else3

cmpl $5, 16(%r14)

jb .LBB0_8

.LBB0_8: # %else4

cmpl $4, 12(%r14)

jb .LBB0_10

.LBB0_10: # %exit

movl %ebx, %eax

popq %rbx

popq %r14

popq %rbp

ret

.LBB0_1: # %then1

movl %ebp, %edi

movl $1, %esi

movl %ebx, %edx

callq error

.LBB0_3: # %then2

movl %ebp, %edi

movl $1, %esi

movl %ebx, %edx

callq error

.LBB0_5: # %then3

movl %ebp, %edi

movl $1, %esi

movl %ebx, %edx

callq error

.LBB0_7: # %then4

movl %ebp, %edi

movl $1, %esi

movl %ebx, %edx

callq error

.LBB0_9: # %then5

movl %ebp, %edi

movl $1, %esi

movl %ebx, %edx

callq error

.Ltmp11:

.size test, .Ltmp11-test

.Ltmp12:

.cfi_endproc

.Leh_func_end0:

.section ".note.GNU-stack","",@progbits

block_placement_codegen.patch

Jakob Stoklund Olesen

unread,

Oct 19, 2011, 10:56:21 AM10/19/11

to Chandler Carruth, LLVM Developers Mailing List

On Oct 19, 2011, at 5:50 AM, Chandler Carruth wrote:

> Ok, wow that wasn't hard at all.

Awesome ;-)

> This is still *very* much a rough draft, but it's probably better to review that the previous patch. One big caveat, I know I have an iteration bug in here somewhere that is inf-looping. Just ran out of steam debugging it, will pick it back up again later today to shake it out.

Maybe splicing a block into it current position will create a loop?

Some random notes:

- Please add a description of the algorithm.
- Please add a comment to the BlockChain class.
- Use a separate anonymous namespace per class, and don't indent for the namespace.

+BlockChain *BlockPlacement2::CreateChain(MachineBasicBlock *BB) {
+ Chains.push_back(BlockChain(BlockToChain, BB));
+ BlockToChain[BB] = &Chains.back();
+ assert(ActiveChains.insert(&Chains.back()));
+ return &Chains.back();
+}

Whoa, you are storing pointers into a growing vector. You should at least assert(Chains.size() != Chains.capacity()) before pushing.

+ RequiredChain->BBEnd = ++MachineFunction::iterator(From);

Does C++ allow this these days? I would prefer llvm::next(MachineFunction::iterator(From))

Note that MachineFunction::iterator decays into an MBB pointer, so you can say FI->canFallThrough() and AnalyzeBranch(*FI...)

+ WeightedEdge WE = { BaseFrequency * MBPI->getEdgeProbability(From, To),

Did we add API for this computation? We intended to add a function that computes this and saturates on overflow etc.

> Still, for the test cases that don't tickle the iteration bug, it generates beautiful, well laid out code. =]

I am sure others have strong opinions about the algorithm ;-)

I would like to see a more explicit handling of loop layout. There is a tradeoff between entering the loop with a branch, and having the exit block last. It is not clear to me that this algorithm handles that properly.

Be careful about placing too much trust in the behavior of branch probabilities. They go out of whack when they saturate.

Second, you should handle blocks that can never fall through (indirect branches from jump tables). There is no need to try to place their successors.

/jakob

Evan Cheng

unread,

Oct 19, 2011, 9:15:01 PM10/19/11

to Jakob Stoklund Olesen, LLVM Developers Mailing List

I agree. The current CodePlacementOpt pass handles blocks layout within a loop. The new BB placement optimization should be integrated with it.

Evan

Andrew Trick

unread,

Oct 19, 2011, 9:48:26 PM10/19/11

to Chandler Carruth, LLVM Developers Mailing List

On Oct 19, 2011, at 7:56 AM, Jakob Stoklund Olesen wrote:

>> This is still *very* much a rough draft, but it's probably better to review that the previous patch. One big caveat, I know I have an iteration bug in here somewhere that is inf-looping. Just ran out of steam debugging it, will pick it back up again later today to shake it out.
>

> Some random notes:
>
> - Please add a description of the algorithm.
> - Please add a comment to the BlockChain class.
> - Use a separate anonymous namespace per class, and don't indent for the namespace.
>

...

> Note that MachineFunction::iterator decays into an MBB pointer, so you can say FI->canFallThrough() and AnalyzeBranch(*FI...)
>
> + WeightedEdge WE = { BaseFrequency * MBPI->getEdgeProbability(From, To),
>
> Did we add API for this computation? We intended to add a function that computes this and saturates on overflow etc.

Overflow is handled transparently in the overloaded BlockFrequency::operator*(BranchProbability). But you could use the existing API anyway by adding a helper to MachineBlockFrequencyInfo calling BlockFrequencyImpl::getEdgeFreq(Src,Dst).

>> Still, for the test cases that don't tickle the iteration bug, it generates beautiful, well laid out code. =]
>
> I am sure others have strong opinions about the algorithm ;-)

I may be one of those others ;-). Although handling SCCs (loops) probably saves Chandler's design from going too far off the rails. I don't remember that being in the published algorithm.

> I would like to see a more explicit handling of loop layout. There is a tradeoff between entering the loop with a branch, and having the exit block last. It is not clear to me that this algorithm handles that properly.

How do you ensure that BlockChains start at loop headers?

How do you ensure that loop exits are laid out following the loop body, especially in the presence of critical edges?

Why not layout blocks according to MachineLoopInfo? The SCC abstraction seems overly generic and unnecessary.

When you merge chains, why don't you delete the edge between the chains?

Why do you run profile guided block layout after the existing CodePlacementOpt? Shouldn't it be the other way around so that CodePlacementOpt can cleanup loops, which BlockPlacement2 doesn't handle well? I think the answer is that BlockPlacement2 actually depends on loops being laid out sensibly before running, but that needs to be explained.

>
> Be careful about placing too much trust in the behavior of branch probabilities. They go out of whack when they saturate.
>
> Second, you should handle blocks that can never fall through (indirect branches from jump tables). There is no need to try to place their successors.

Please put this under some flag for now. Enabling it by default opens up a whole set of issues that are premature to discuss. Until the overall code layout strategy is more coherent, people will only want to enable profile guided layout when they have complete profile info, or really need to take full advantage of builtin expect. If you're not one of those people, you'll want to leave this disabled to avoid unnecessary misery when debugging generated disassembly, and reduce the number of places that codegen makes semi-random decisions that perturb the code.

Otherwise, the code looks nice! Good comments.

-Andy

Chandler Carruth

unread,

Oct 20, 2011, 12:56:05 PM10/20/11

to Andrew Trick, LLVM Developers Mailing List

Thanks for all of the comments Andy and Jakob. A new patch is attached that is *much* less of a rough draft. Sorry for any confusion due to the early state of the patch.

Also, many many thanks to Jakob for his explanation of how the branches between MBBs should be handled, and Andy, Jim, and Eric who helped answer my questions on IRC when I ran into stumbling blocks. Also, Nick, who shoulder surfed and helped with a lot of the tricky debugging. The previous patch had some... very annoying to track down bugs. =]

Still, I never intended this to be on-by-default at first. This is a starting point, that I hope can be improved into something that is on-by-default eventually, but I'll be the first to say it's almost certainly not ready for that just yet.

I'm more hopeful that we can delete the existing block placement pass, and direct anyone interested in profile file guided stuff to write a simple pass to load profiles into metadata. I suspect this pass is already superior to that one.

Responses to some comments inline:

On Wed, Oct 19, 2011 at 6:48 PM, Andrew Trick <atr...@apple.com> wrote:

On Oct 19, 2011, at 7:56 AM, Jakob Stoklund Olesen wrote:

>> This is still *very* much a rough draft, but it's probably better to review that the previous patch. One big caveat, I know I have an iteration bug in here somewhere that is inf-looping. Just ran out of steam debugging it, will pick it back up again later today to shake it out.
>

> Some random notes:
>
> - Please add a description of the algorithm.

There is more described now in various comments, but I'm going to work on an overview in the file comments. Not done yet, but wanted to get it back out for review in a more polished state.

> - Please add a comment to the BlockChain class.

Done.

> - Use a separate anonymous namespace per class, and don't indent for the namespace.

Done.

>

...

> Note that MachineFunction::iterator decays into an MBB pointer, so you can say FI->canFallThrough() and AnalyzeBranch(*FI...)

I'm aware, but there were a few weird places where it didn't happen, and 'From' seemed like a more readable name anyways... Lemme know if you'd really rather I use the iterator everywhere.

>
> + WeightedEdge WE = { BaseFrequency * MBPI->getEdgeProbability(From, To),
>
> Did we add API for this computation? We intended to add a function that computes this and saturates on overflow etc.

Overflow is handled transparently in the overloaded BlockFrequency::operator*(BranchProbability). But you could use the existing API anyway by adding a helper to MachineBlockFrequencyInfo calling BlockFrequencyImpl::getEdgeFreq(Src,Dst).

Yea, this should be handled by the BlockFrequency object correctly here, but if you'd rather see the helper I can add that. Was just keeping this patch more focused on the pass.

>> Still, for the test cases that don't tickle the iteration bug, it generates beautiful, well laid out code. =]
>
> I am sure others have strong opinions about the algorithm ;-)

I may be one of those others ;-). Although handling SCCs (loops) probably saves Chandler's design from going too far off the rails. I don't remember that being in the published algorithm.

It definitely isn't. The published algorithm is very hand-wave-y about exactly how you select between chains when there isn't an obvious ordering.

> I would like to see a more explicit handling of loop layout. There is a tradeoff between entering the loop with a branch, and having the exit block last. It is not clear to me that this algorithm handles that properly.

Keep in mind that the SCCs are across block *chains*. Most of the loops I saw formed a single chain with back edges, and so would be laid out exactly as they came into this pass, modulo fallthrough branches that for some reason we have probability information that says aren't likely to be taken. That said, I haven't done a thorough investigation of this... See below...

How do you ensure that BlockChains start at loop headers?

How do you ensure that loop exits are laid out following the loop body, especially in the presence of critical edges?

These are really interesting questions. I'm not sure how to answer them. Is this something we could continue to iterate on? Maybe you could point me at how to dig into these issues?

Why not layout blocks according to MachineLoopInfo? The SCC abstraction seems overly generic and unnecessary.

I'm not sure -- the SCCs here are often at a much coarser granualarity than those in the MBB graph. If anything, it might be good to use MachineLoopInfo to force loops into a single pre-laid-out chain; but I haven't looked at that pass yet. I'm interested in adding this kind of special and careful logic to the pass though, and making it much more loop-structure-aware.

When you merge chains, why don't you delete the edge between the chains?

No good reason. The edges were a wart on the entire design. They're gone now, and we just use the BB -> successor list -> block-to-chain mapping sequence to deduce edges when needed.

Why do you run profile guided block layout after the existing CodePlacementOpt? Shouldn't it be the other way around so that CodePlacementOpt can cleanup loops, which BlockPlacement2 doesn't handle well?

Yep, I just slapped it in there for testing. I've put it immediately before the CodePlacementOpt pass, but I know very little about the other passes. Let me know if there is a better home for it.

I think the answer is that BlockPlacement2 actually depends on loops being laid out sensibly before running, but that needs to be explained.

Essentially, yes. How should we document it? Does that change if we teach it to explicitly use some of the loop analysis passes?

> Be careful about placing too much trust in the behavior of branch probabilities. They go out of whack when they saturate.
>
> Second, you should handle blocks that can never fall through (indirect branches from jump tables). There is no need to try to place their successors.

Please put this under some flag for now. Enabling it by default opens up a whole set of issues that are premature to discuss. Until the overall code layout strategy is more coherent, people will only want to enable profile guided layout when they have complete profile info, or really need to take full advantage of builtin expect. If you're not one of those people, you'll want to leave this disabled to avoid unnecessary misery when debugging generated disassembly, and reduce the number of places that codegen makes semi-random decisions that perturb the code.

Couldn't agree more. The goal is to get something started.

Otherwise, the code looks nice! Good comments.

Thanks! Please let me know both what needs to be done to get this first version checked in, and what the important next steps are.

On my end, I'm working on getting some benchmark numbers next.

block_placement_codegen.patch

Jakob Stoklund Olesen

unread,

Oct 20, 2011, 1:53:34 PM10/20/11

to Chandler Carruth, LLVM Developers Mailing List

On Oct 20, 2011, at 9:56 AM, Chandler Carruth wrote:

> A new patch is attached that is *much* less of a rough draft. Sorry for any confusion due to the early state of the patch.

Thanks, Chandler. This is great stuff.

> Still, I never intended this to be on-by-default at first. This is a starting point, that I hope can be improved into something that is on-by-default eventually, but I'll be the first to say it's almost certainly not ready for that just yet.

I would like to see this go in ASAP under a flag. I prefer to see development as commits rather than a series of updated patches.

Could you rename it to MachineBlockPlacement.cpp first, though? That makes it clear it's a CodeGen pass, and the BlockPlacement2 name is icky.

> I'm more hopeful that we can delete the existing block placement pass, and direct anyone interested in profile file guided stuff to write a simple pass to load profiles into metadata. I suspect this pass is already superior to that one.

I also see it as a replacement for CodePlacementOpt.cpp, so I think your flag should run your pass instead of CodePlacementOpt rather than before or after it.

You should simply clone the loop alignment stuff, it's pretty trivial.

/jakob

Chandler Carruth

unread,

Oct 21, 2011, 2:53:37 AM10/21/11

to Jakob Stoklund Olesen, LLVM Developers Mailing List

On Thu, Oct 20, 2011 at 10:53 AM, Jakob Stoklund Olesen <stok...@2pi.dk> wrote:

On Oct 20, 2011, at 9:56 AM, Chandler Carruth wrote:

> A new patch is attached that is *much* less of a rough draft. Sorry for any confusion due to the early state of the patch.

Thanks, Chandler. This is great stuff.

> Still, I never intended this to be on-by-default at first. This is a starting point, that I hope can be improved into something that is on-by-default eventually, but I'll be the first to say it's almost certainly not ready for that just yet.

I would like to see this go in ASAP under a flag. I prefer to see development as commits rather than a series of updated patches.

Awesome, same way I feel. Checked in as r142641. I've got some generic cleanup I'll go ahead and commit to it over the next few days.

Also, let me know whether you'd prefer post-commit or pre-commit review for various feature additions we've talked about here. I'm fine either way, so its really what makes things easier on your end. I'll definitely want review on all of it as some of this (especially the loop structure issues) is still a bit new to me. =D

At least some of it (the loop alignment from CodePlacementOpt) is obvious how to port across, but I'm sure there will be more tricky elements around the loop structures.

Could you rename it to MachineBlockPlacement.cpp first, though? That makes it clear it's a CodeGen pass, and the BlockPlacement2 name is icky.

Yea, the 2 thing was only intended to work around a brief duplication with the existing pass. MachineBlockPlacement solves both problems nicely.

> I'm more hopeful that we can delete the existing block placement pass, and direct anyone interested in profile file guided stuff to write a simple pass to load profiles into metadata. I suspect this pass is already superior to that one.

I also see it as a replacement for CodePlacementOpt.cpp, so I think your flag should run your pass instead of CodePlacementOpt rather than before or after it.

You should simply clone the loop alignment stuff, it's pretty trivial.

Done, and in progress. SHould have the loop alignment stuff cloned right away, and then i'll start looking at the loop structure issues. I'll probably have some questions for you and/or andy about exactly how that should work, whether CodePlacementOpt is doing the right thing, etc.

Andrew Trick

unread,

Oct 21, 2011, 9:17:07 PM10/21/11

to Chandler Carruth, LLVM Developers Mailing List

On Oct 20, 2011, at 11:53 PM, Chandler Carruth wrote:

On Thu, Oct 20, 2011 at 10:53 AM, Jakob Stoklund Olesen <stok...@2pi.dk> wrote:
On Oct 20, 2011, at 9:56 AM, Chandler Carruth wrote:

> A new patch is attached that is *much* less of a rough draft. Sorry for any confusion due to the early state of the patch.

Thanks, Chandler. This is great stuff.

> Still, I never intended this to be on-by-default at first. This is a starting point, that I hope can be improved into something that is on-by-default eventually, but I'll be the first to say it's almost certainly not ready for that just yet.

I would like to see this go in ASAP under a flag. I prefer to see development as commits rather than a series of updated patches.

...

Awesome, same way I feel. Checked in as r142641. I've got some generic cleanup I'll go ahead and commit to it over the next few days.

Done, and in progress. SHould have the loop alignment stuff cloned right away, and then i'll start looking at the loop structure issues. I'll probably have some questions for you and/or andy about exactly how that should work, whether CodePlacementOpt is doing the right thing, etc.

Your work looks good and there's talk of making this the default

layout pass. So let's figure out the goals and requirements. We agree

that preserving "structure" of the CFG is important in the absence of

profile information to the contrary. This is important for performance

stability in the presence of imperfect profile data and unbiased

branches. But it's also important for the sanity of fellow compiler

engineers and anyone debugging the disassembly. We can define

structure very simply by saying that we want contiguous loops and

topologically ordered regions within a loop. Those are not the

strongest constraints. We could still end up with strangely shuffled

code, but they are easy constraints to define.

Once you decide to break these constraints, you have effectively

designated the offending paths as "cold". It's up to you how to

decide which paths are cold. You'll need some threshold based on

confidence in the profile, and we can measure the effectiveness of

your heuristics and add more over time. The key point is that the

layout algorithm needs to make an explicit binary decision at some

point to violate the natural structure of the CFG.

I don't care much about how you layout the cold chains as long as they

are not interleaved with the rest of the code, thus breaking the

topological ordering and forcing extra branches. In practice, this

means they get their own little Siberia after the function return,

effectively splitting the function into two sections. However, the

notion of "coldness" is really relative to the current loop head. So

you may layout a nice contiguous, topologically ordered inner loop,

then determine the whole thing is cold relative to its outer loop, so

send the whole thing packing to Siberia. That's exactly what we

want. The loop itself is sane, but doesn't get in the way of hotter

enclosing code.

Even within the constraints imposed by CFG structure, you can reduce

the likelihood of taken branches. Here you can employ any

profile-based or static heuristics with lower confidence, because the

resulting layout will be reasonably good either way. This will still

do a great job in the absence of accurate profile information, builtin

expect, or very strong static heuristics (e.g. landing pads are cold).

This is where the algorithm gets interesting and you can be creative,

but I'll give you a rough idea so we're on the same page:

1) Visit each loop independently inside-out.

2) Construct chains of blocks within the loop. Here the "entry" chain

is the loop header, or function entry. You can probably represent

each already laid out inner loop as a single chain. The graph of

chains within a loop is *acyclic* except for irreducible control

flow, which doesn't have any structure to preserve anyway. There's

no need to compute SCCs.

3) Stop constructing the chain at a CFG merge. Chaining past a merge

would immediately violate topological ordering, which is only ok if

other predecessors of the merge have been designated "cold".

4) Splice the chains together, preserving topological order. Assigning

each block to a DFS reverse-post-order number provides a nice

ordering that is easy to compare or sort.

Let me give you a little example to think about:

A:

br B(80%), C(20%)

B:

br D

C:

br D(90%), E(10%)

D:

br F

E:

br F

F:

ret

This is a fine topological sort but bad layout based on expected

branch probability (even with low confidence in the profile).

The only good layout that preserves topological ordering is:

A, B, C, E, D, F

.8 * 1 branch + .18 * 2 branches + .02 * 2 branches

This approach likely covers the existing functionality of

CodePlacement, which attempts to keep loops contiguous.

You will probably need to do the loop tail adjustment that

CodePlacement currently does as a special case.

(Shift the tail of the loop to precede its header, resulting in

a fall-through loop test.)

We can discuss the design more, but you may want to think about it and

come with your own design first.

Finally, we need some way to validate the design and verify the

implementation. Weighted statistics will be very useful here, similar

to those that Jakob used in the register allocator. For example:

For each nonadjacent edge:

TakenBranchFrequency += frequency(Edge)

This metric will prefer the most "extreme" layout because it assumes

perfect accuracy of block frequencies. It would be nice to look at the

same metric computed from skewed branch probabilities to see how

sensitive it is to small shifts in branch behavior.

Taken branch frequency is the most obvious, but doesn't completely

capture locality. You could also measure distance jumped fairly

easily, also weighted by edge frequency.

It might be useful to gather these stats as late as possible to take

into consideration the final impact of layout, including the effect on

any later passes.

-Andy

Chandler Carruth

unread,

Oct 22, 2011, 7:25:33 AM10/22/11

to Andrew Trick, LLVM Developers Mailing List

FYI, I'm still working on the details, and the code is a complete hack, but I have a big chunk of this working now. =] The more precise strategy I implemented follows your suggestion very closely. I've got lots more ideas to layer on top of this, but this is my first stab at providing a better underpinning.

First, it walks all loops in the function from the inside out just as you describe. =] Really like this, it's a very nice model to use.

For each loop, after having walked all sub-loops, it walks forward across all the blocks in the loop. For each block not part of a chain, it creates a single-block chain. If the block has more than one predecessor (I think this is what you were after w/ a "merge point"?), we're done. If the block is the start of the chain it belongs to, and if it is the most likely of the in-loop successors of the previous block, and that previous block is the tail of its chain, we merge its chain to the previous block's chain. This should only form chains which are valid in the existing topological ordering within a loop, and don't cross loop boundaries or cross incoming branches.

Then we traverse the BB graph in RPO, and for each block look up its chain. The first time we encounter a chain, we splice it into the function.

I think this satisfies your desire to preserve the structural components of the CFG? If not, I'd love to understand where I've deviated / what I've misunderstood.

It already makes some minimal use of the probabilities to select which topology-preserving fall-through options to prefer. I'm also going to tie-break with the existing layout. I've not started to really integrate it with probabilities though, I wanted to get something that fit the constraint model you're looking for first, and had clean traversal semantics.

I'll clean the patch up and mail it out tomorrow. The implementation gets tremendously simpler by A) using the loop info, and B) exclusively walking forward over the blocks. The first simplified the traversal greatly, and the second made it easy to just store pointers to basic blocks for the chains rather than moving blocks around in the function itself. That gets rid of a bunch of other code, and the most bug-prone of the other code. So all in all, I like where this is going.

Chandler Carruth

unread,

Oct 23, 2011, 4:11:52 AM10/23/11

to Andrew Trick, LLVM Developers Mailing List

Ok, I think I have a working pass that is based much more on what we've talked about here. The patch is attached. I'd love to commit it soon-ish and then start tweaking it based on feedback from you, others, and looking at how it actually works in the wild.

It's essentially a re-write, so it may be hard to read... Let me know if some other form would be easier.

Some responses to your comments below, perhaps giving insight into the choices I made or what the pass is currently doing and what I haven't addressed yet.

On Fri, Oct 21, 2011 at 6:17 PM, Andrew Trick <atr...@apple.com> wrote:

Once you decide to break these constraints, you have effectively
designated the offending paths as "cold". It's up to you how to

decide which paths are cold. You'll need some threshold based on
confidence in the profile, and we can measure the effectiveness of
your heuristics and add more over time.

FYI, I ended up with a heuristic that requires a hot edge to be 4x more likely than any other competing in the CFG. This is very loosely based on the definition of a 'hot' edge in BranchProbabilityInfo (80% vs. 20%).

I don't care much about how you layout the cold chains as long as they

are not interleaved with the rest of the code, thus breaking the
topological ordering and forcing extra branches. In practice, this
means they get their own little Siberia after the function return,

effectively splitting the function into two sections. However, the
notion of "coldness" is really relative to the current loop head.

I've not really tried to ensure this happens correctly. Mostly, this is just forming hot-paths through the code when they are *really* hot. I'll work on better sectioning off the cold paths that are cut off in a followup patch.

2) Construct chains of blocks within the loop. Here the "entry" chain

is the loop header, or function entry. You can probably represent
each already laid out inner loop as a single chain. The graph of
chains within a loop is *acyclic* except for irreducible control

flow, which doesn't have any structure to preserve anyway. There's
no need to compute SCCs.

I'm actually not satisfied with how I've done this. I think there is likely a cleaner / better solution than the one I have in the patch, but it largely works, and is easy to change later. The patch just walks forward over the function, and tries to chain blocks together with their successors when the CFG permits, choosing successors based on whatever (even weak) signal the probability info gives us.

4) Splice the chains together, preserving topological order. Assigning

each block to a DFS reverse-post-order number provides a nice
ordering that is easy to compare or sort.

Currently I'm doing this once at the very end. I wonder if it would be better to do iteratively as we walk up through the loops?

Let me give you a little example to think about:

A:
br B(80%), C(20%)
B:
br D
C:
br D(90%), E(10%)
D:
br F
E:
br F
F:
ret

This is a fine topological sort but bad layout based on expected
branch probability (even with low confidence in the profile).

The only good layout that preserves topological ordering is:

A, B, C, E, D, F

I made a test case out of this, and I think it works. It's a bit confusing at it just replaces branches to F with ret. That removes the edge from D to F, and makes the following ordering also viable: A, B, C, D, E. That's the ordering it chooses, and it also duplicates D up into B instead of branching from B to D. :: shrug ::. It looks pretty good to me though. =]

Finally, we need some way to validate the design and verify the

implementation. Weighted statistics will be very useful here, similar
to those that Jakob used in the register allocator. For example:

For each nonadjacent edge:
TakenBranchFrequency += frequency(Edge)

This metric will prefer the most "extreme" layout because it assumes
perfect accuracy of block frequencies. It would be nice to look at the
same metric computed from skewed branch probabilities to see how

sensitive it is to small shifts in branch behavior.

Taken branch frequency is the most obvious, but doesn't completely
capture locality. You could also measure distance jumped fairly

easily, also weighted by edge frequency.

It might be useful to gather these stats as late as possible to take
into consideration the final impact of layout, including the effect on

any later passes.

Cool, I'm working on this next. I've also uncovered some *completely* broken aspects of branch probability. Going back and adding tests and fixes for it. =]

mbp-redux.patch

Chandler Carruth

unread,

Oct 23, 2011, 4:14:52 AM10/23/11

to Andrew Trick, LLVM Developers Mailing List

On Sun, Oct 23, 2011 at 1:11 AM, Chandler Carruth <chan...@google.com> wrote:

Ok, I think I have a working pass that is based much more on what we've talked about here. The patch is attached. I'd love to commit it soon-ish and then start tweaking it based on feedback from you, others, and looking at how it actually works in the wild.

After chatting briefly w/ Owen in IRC, he encouraged me to just commit this as it's behind an off-by-default flag, and not touching anything else. I'm going to go ahead with that so I can more easily start working on statistics and clean up associated other stuff. It'll also help me debug some weird probabilities.

Feel free to carry on with discussion here or on the commit thread. =] Thanks again for all the really great feedback and thoughts!!

Andrew Trick

unread,

Oct 23, 2011, 2:31:05 PM10/23/11

to Chandler Carruth, LLVM Developers Mailing List

That's fine. I've been following along and what you're doing sounds right. I'll try to provide feedback on your latest commit once I have time to fully understand how it works.

Andy

Sent from my iPhone

Andrew Trick

unread,

Oct 23, 2011, 2:45:09 PM10/23/11

to Chandler Carruth, LLVM Developers Mailing List

On Oct 23, 2011, at 1:11 AM, Chandler Carruth <chan...@google.com> wrote:

Let me give you a little example to think about:

A:
br B(80%), C(20%)
B:
br D
C:
br D(90%), E(10%)
D:
br F
E:
br F
F:
ret

This is a fine topological sort but bad layout based on expected
branch probability (even with low confidence in the profile).

The only good layout that preserves topological ordering is:

A, B, C, E, D, F

I made a test case out of this, and I think it works. It's a bit confusing at it just replaces branches to F with ret. That removes the edge from D to F, and makes the following ordering also viable: A, B, C, D, E. That's the ordering it chooses, and it also duplicates D up into B instead of branching from B to D. :: shrug ::. It looks pretty good to me though. =]

Great. I didn't expect it to work as test case without tweaking. The case when no blocks get folded or tail duped is interesting. You'll have to disable those codegen optimizations or just add blocks and instructions until they're defeated.

Andy

Andrew Trick

unread,

Oct 23, 2011, 2:51:24 PM10/23/11

to Chandler Carruth, LLVM Developers Mailing List

On Oct 23, 2011, at 1:11 AM, Chandler Carruth <chan...@google.com> wrote:

I don't care much about how you layout the cold chains as long as they

are not interleaved with the rest of the code, thus breaking the
topological ordering and forcing extra branches. In practice, this
means they get their own little Siberia after the function return,

effectively splitting the function into two sections. However, the
notion of "coldness" is really relative to the current loop head.

I've not really tried to ensure this happens correctly. Mostly, this is just forming hot-paths through the code when they are *really* hot. I'll work on better sectioning off the cold paths that are cut off in a followup patch.

Once you move a block " out of line" it will naturally end up floating to the end of the function when you do a good job splicing the other warm chains. So I don't think it needs to be explicit in the design. The explicit decision is simply when to move a chain out of line ( no longer topplogical).

Depending on the algorithm you may need to do some work to ensure loops are followed by their "best" loop exit.

Andy

Reply all

Reply to author

Forward