[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Sriraman Tallam via llvm-dev

unread,

Sep 24, 2019, 7:52:08 PM9/24/19

to llvm...@lists.llvm.org

Greetings,

We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer
framework, on large google benchmarks and noticed that it improves key
performance metrics of these benchmarks by 2% to 6%, which is pretty impressive
as this is over and above a baseline binaryalready heavily optimized with
ThinLTO + PGO. Furthermore, BOLT is also able to improve the performance of
binaries optimized via Context-Sensitive PGO. While ThinLTO + PGO is also
profile guided and does very aggressive performance optimizations, there is
more room for performance improvements due to profile approximations while
applying the transformations. BOLT uses exact profiles from the final binary
and is able to fill the gaps left by ThinLTO + PGO. The performance
improvements due to BOLT come from basic block layout, function reordering and
function splitting.

While BOLT does an excellent job of squeezing extra performance from highly
optimized binaries with optimizations such as code layout, it has these major
issues:

* It does not take advantage of distributed build systems.
* It has scalability issues and to rewrite a binary with a ~300M text segment
size:
* Memory foot-print is 70G.
* It takes more than 10 minutes to rewrite the binary.

Similar to Full LTO, BOLT’s design is monolithic as it disassembles the
original binary, optimizes and rewrites the final binary in one process. This
limits the scalability of BOLT and the memory and time overhead shoots up
quickly for large binaries.

Inspired by the performance gains and to address the scalability issue of BOLT,
we went about designing a scalable infrastructure that can perform BOLT-like
post-link optimizations. In this RFC, we discuss our system, “Propeller”,
which can perform profile guided link time binary optimizations in a scalable
way and is friendly to distributed build systems. Our system leverages the
existing capabilities of the compiler tool-chain and is not a stand alone tool.
Like BOLT, our system boosts the performance of optimized binaries via
link-time optimizations using accurate profiles of the binary. We discuss the
Propeller system and show how to do the whole program basic block layout using
Propeller.

Propeller does whole program basic block layout at link time via basic block
sections. We have added support for having each basic block in its own section
which allows the linker to do arbitrary reorderings of basic blocks to achieve
any desired fine-grain code layout which includes block layout, function
splitting and function reordering. Our experiments on large real-world
applications and SPEC with code layout show that Propeller can optimize as
effectively as BOLT, with just 20% of its memory footprint and time overhead.

An LLVM branch with propeller patches is available in the git repository here:
https://github.com/google/llvm-propeller/ We will upload individual patches of
the various elements for review. We have attached a google doc describing the
Propeller system with Experiments in detail,
https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf

Quick Start - How to optimize further with Propeller?

# git clone and build repo

$ cd $LLVM_DIR && git clone https://github.com/google/llvm-propeller.git

$ mkdir $BUILD_DIR && cd $BUILD_DIR

$ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \
$LLVM_DIR/llvm-propeller/llvm

$ ninja -j25

$ export PATH=$BUILD_DIR/bin:$PATH

# Let’s Propeller-optimize the following program:

# Step 1: Build the peak optimized binary with an additional flag.

$ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels -fuse-ld=lld

# Step 2: Profile the binary, only one side of the branch is executed.
$ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >& /dev/null

# Step 3: Convert the profiles using the tool provided
$ $LLVM_DIR/llvm-propeller/create_llvm_prof --format=propeller \
--binary=./a.out.labels --profile=perf.data --out=perf.propeller

# Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag changed.
$ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller -fuse-ld=lld

In Step 4, the optimized bit code can be used if it is saved in Step1 as
Propeller is active only during compile backend and link. The optimized binary
has a different layout of the basic blocks in main to keep the executed blocks
together and split the cold blocks.

Some of the key points:

+ Added support for basic block sections, similar to function sections, where
each basic block can reside in its own section.

+ Jump instructions need 32-bit relocations and subsequent linker relaxations
after basic block layout. We would like to add a new relocation type for jump
instructions to make it easier for relaxations and guarantee correctness.

+ Added support in the linker to read profiles (PMU LBR) and discover optimal
basic block layout using the Ex-TSP algorithm described here:
https://arxiv.org/abs/1809.04676

+ Added support in the linker to re-order basic block sections in any
specified sequence (can be done with symbol ordering file). This requires
linker relaxations to delete and shrink branches across basic blocks.

+ Compared our system to BOLT and have shown that our system can produce
similar performance improvements as BOLT but with much less memory and time
overheads. Our experiments are on very large warehouse-scale benchmarks and
SPEC 2017.

+ Listed why this cannot be done as part of PGO itself. Post Link
optimizations are able to transform the binary using accurate profiles and PGO
passes suffer from profile imprecision.

+ Updated DebugInfo and CFI to account for arbitrary ordering of basic blocks
via basic block sections.

+ Discussed CFI handling and is sub-optimal and bloats object file sizes and
binary sizes much more than DebugInfo due to lack of support for discontiguous
address ranges. We have talked about this and would like to make a case to
support discontiguous ranges with CFI which will make basic block sections much
more cheaper.

Detailed RFC document here :
https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf

Please let us know what you think,
Thanks
Sri on behalf of the team.
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Mehdi AMINI via llvm-dev

unread,

Sep 25, 2019, 4:36:44 AM9/25/19

to Sriraman Tallam, llvm-dev

Hi Sriraman,

This is an impressive piece of work! The results look really good, and the document you provided is very thorough. Looking forward to the patches :)

Best,

--

Mehdi

Michael Spencer via llvm-dev

unread,

Sep 25, 2019, 6:44:40 PM9/25/19

to Sriraman Tallam, LLVMDev

On Tue, Sep 24, 2019 at 4:52 PM Sriraman Tallam via llvm-dev <llvm...@lists.llvm.org> wrote:

The function ordering parts of this look like they are doing the same thing as lld's existing function reordering code, just with a different profile source and different ordering algorithm. Have you compared the performance of just the function reordering bits to what we already have (https://github.com/google/llvm-propeller/blob/plo-dev/lld/ELF/CallGraphSort.cpp)?

I would expect that whichever ordering algorithm is best would be the best for both profile sources, so I would recommend looking at integrating the function reordering code with the existing code instead of adding a separate one.

- Michael Spencer

Eli Friedman via llvm-dev

unread,

Sep 25, 2019, 8:02:14 PM9/25/19

to Sriraman Tallam, llvm-dev

My biggest question about this architecture is about when propeller runs basic block reordering within a function. It seems like a lot of the complexity comes from using the proposed -fbasicblock-sections to generated mangled ELF, and then re-parsing the mangled ELF as a separate step. I'm not sure that's the right approach, long-term.

Splitting every basic block into its own section introduces overhead, like you described. And it's likely more complex on non-x86 targets, which have a greater variety of conditional branches. And the reordering itself involves a bunch of x86 and ELF-specific code.

I'd like to suggest an alternative: instead of perform basic block reordering and function splitting by manipulating the ELF files, you could perform reordering and splitting as part of link-time code generation, as an MIR pass that runs just before the assembly printer. MIR is almost exactly the form you want for this sort of manipulation: it has basic blocks which correspond closely to the final binary, and a high-level representation of branch instructions. And it's before the DWARF/CFI emission, so you don't need to worry about fixing them afterwards. This should take less code overall, and much less target-specific code. And infrastructure for function splitting would be useful for non-Propeller workflows.

There are some minor downsides to this approach I can think of. You lose a little flexibility, in that you can't mix blocks from different functions together, but you aren't doing that anyway, from your description? It might take more CPU time to re-do LTO code generation, but that's probably not a big deal for applications where the user is willing to collect two different kinds PGO of profiles for a program.

I understand you can't use a regular PGO profile for this, but it should be possible to pass in a Propeller-PGO profile separately to the compiler, and apply that to the MIR as a late pass in the compiler, separate from any regular PGO profile. There are probably multiple ways you could do this; you could use basic block symbols, like your current implementation.

-Eli

Sriraman Tallam via llvm-dev

unread,

Sep 25, 2019, 8:19:15 PM9/25/19

to Eli Friedman, llvm-dev

On Wed, Sep 25, 2019 at 5:02 PM Eli Friedman <efri...@quicinc.com> wrote:
>
> My biggest question about this architecture is about when propeller runs basic block reordering within a function. It seems like a lot of the complexity comes from using the proposed -fbasicblock-sections to generated mangled ELF, and then re-parsing the mangled ELF as a separate step. I'm not sure that's the right approach, long-term.
>
> Splitting every basic block into its own section introduces overhead, like you described. And it's likely more complex on non-x86 targets, which have a greater variety of conditional branches. And the reordering itself involves a bunch of x86 and ELF-specific code.
>
> I'd like to suggest an alternative: instead of perform basic block reordering and function splitting by manipulating the ELF files, you could perform reordering and splitting as part of link-time code generation, as an MIR pass that runs just before the assembly printer. MIR is almost exactly the form you want for this sort of manipulation: it has basic blocks which correspond closely to the final binary, and a high-level representation of branch instructions. And it's before the DWARF/CFI emission, so you don't need to worry about fixing them afterwards. This should take less code overall, and much less target-specific code. And infrastructure for function splitting would be useful for non-Propeller workflows.
>
> There are some minor downsides to this approach I can think of. You lose a little flexibility, in that you can't mix blocks from different functions together, but you aren't doing that anyway, from your description?

I have mentioned this in the RFC under alternate design in the
Experments. We did consider this approach. This is our first attempt
at doing this end to end. We plan to do much more. We are looking at
much more aggressive inter-procedural code layout optimizations based
on path profiles and machine learning, which requires having support
for basic block sections. Being able to do inter-procedural block
layout is already showing more performance improvements over what we
have right now. We don't think basic block sections is expensive if
done on demand and there are opportunities for performance exploration
with this support.

Xinliang David Li via llvm-dev

unread,

Sep 25, 2019, 8:58:45 PM9/25/19

to Eli Friedman, llvm-dev

On Wed, Sep 25, 2019 at 5:02 PM Eli Friedman via llvm-dev <llvm...@lists.llvm.org> wrote:

My biggest question about this architecture is about when propeller runs basic block reordering within a function. It seems like a lot of the complexity comes from using the proposed -fbasicblock-sections to generated mangled ELF, and then re-parsing the mangled ELF as a separate step. I'm not sure that's the right approach, long-term.

Splitting every basic block into its own section introduces overhead, like you described. And it's likely more complex on non-x86 targets, which have a greater variety of conditional branches. And the reordering itself involves a bunch of x86 and ELF-specific code.

I'd like to suggest an alternative: instead of perform basic block reordering and function splitting by manipulating the ELF files, you could perform reordering and splitting as part of link-time code generation, as an MIR pass that runs just before the assembly printer. MIR is almost exactly the form you want for this sort of manipulation: it has basic blocks which correspond closely to the final binary, and a high-level representation of branch instructions.

This was considered for Propeller. This is currently being explored in a similar way as an alternative of CSFDO which uses PMU samples.

And it's before the DWARF/CFI emission, so you don't need to worry about fixing them afterwards. This should take less code overall, and much less target-specific code. And infrastructure for function splitting would be useful for non-Propeller workflows.

There are some minor downsides to this approach I can think of. You lose a little flexibility, in that you can't mix blocks from different functions together, but you aren't doing that anyway, from your description?

One of the main design objectives of Propeller is to have the capability to do interprocedural code transformations (reordering, cloning, dedupping etc), so this won't be a minor downside. Function/block alignment (for branch misprediction reduction etc) will also need to be done as a global optimization in the future.

David

Sriraman Tallam via llvm-dev

unread,

Sep 25, 2019, 9:14:14 PM9/25/19

to Michael Spencer, LLVMDev

Long term, we are looking at inter-procedural basic block layout,
basic block cloning etc. and this encapsulates block layout, function
splitting and function reordering. We will look at how we could
integrate this with existing callgraph layout algorithm.

>
> - Michael Spencer

Sriraman Tallam via llvm-dev

unread,

Sep 26, 2019, 3:03:50 PM9/26/19

to Michael Spencer, LLVMDev

On Wed, Sep 25, 2019 at 3:44 PM Michael Spencer <bigch...@gmail.com> wrote:
>

We looked at the code and it looks like the function reordering parts
are using the hfsort algorithm, both CallgraphSort and Propeller.
So, we can merge these. When callgraphsort option is passed,
Propeller will invoke the hfsort code to do function layout. We
prefer doing this as a follow-up patch if that is alright as this
requires more plumbing.

>
> I would expect that whichever ordering algorithm is best would be the best for both profile sources, so I would recommend looking at integrating the function reordering code with the existing code instead of adding a separate one.
>
> - Michael Spencer

Eli Friedman via llvm-dev

unread,

Sep 26, 2019, 3:39:05 PM9/26/19

to Xinliang David Li, llvm-dev

From: Xinliang David Li <xinli...@gmail.com>
Sent: Wednesday, September 25, 2019 5:58 PM
To: Eli Friedman <efri...@quicinc.com>
Cc: Sriraman Tallam <tmsr...@google.com>; llvm-dev <llvm...@lists.llvm.org>
Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

On Wed, Sep 25, 2019 at 5:02 PM Eli Friedman via llvm-dev <llvm...@lists.llvm.org> wrote:

My biggest question about this architecture is about when propeller runs basic block reordering within a function. It seems like a lot of the complexity comes from using the proposed -fbasicblock-sections to generated mangled ELF, and then re-parsing the mangled ELF as a separate step. I'm not sure that's the right approach, long-term.

Splitting every basic block into its own section introduces overhead, like you described. And it's likely more complex on non-x86 targets, which have a greater variety of conditional branches. And the reordering itself involves a bunch of x86 and ELF-specific code.

I'd like to suggest an alternative: instead of perform basic block reordering and function splitting by manipulating the ELF files, you could perform reordering and splitting as part of link-time code generation, as an MIR pass that runs just before the assembly printer. MIR is almost exactly the form you want for this sort of manipulation: it has basic blocks which correspond closely to the final binary, and a high-level representation of branch instructions.

This was considered for Propeller. This is currently being explored in a similar way as an alternative of CSFDO which uses PMU samples.

Makes sense.

And it's before the DWARF/CFI emission, so you don't need to worry about fixing them afterwards. This should take less code overall, and much less target-specific code. And infrastructure for function splitting would be useful for non-Propeller workflows.

There are some minor downsides to this approach I can think of. You lose a little flexibility, in that you can't mix blocks from different functions together, but you aren't doing that anyway, from your description?

One of the main design objectives of Propeller is to have the capability to do interprocedural code transformations (reordering, cloning, dedupping etc), so this won't be a minor downside. Function/block alignment (for branch misprediction reduction etc) will also need to be done as a global optimization in the future.

Okay, so my suggestion doesn’t work. I’m still concerned the proposed design is going to push us in a direction we don’t want to go. Particularly, if you’re going to attempt more complicated transforms, the problems caused by the limited information available in an ELF file will become more prominent. I mean, yes, you can come up with a more complicated implicit contract between the compiler and Propeller about the exact format of Propeller basic blocks, and add additional symbol annotations, and eventually come up with an “IR” that allows Propeller to perform arbitrary code transforms. But that’s inevitably going to be more complicated, and harder to understand, than a compiler IR designed for optimizations.

If we are going to stick with ELF-as-compiler-IR, I’d like to see more careful documentation of the contract between the compiler and the Propeller linker, so it’s clear what the compiler is/is not allowed to emit at compile-time, and so other compilers could be made compatible in the future. The current Propeller implementation is doing some code rewrites that can’t be justified by just standard ELF section reordering.

-Eli

Sriraman Tallam via llvm-dev

unread,

Sep 26, 2019, 6:24:59 PM9/26/19

to Eli Friedman, llvm-dev

On Thu, Sep 26, 2019 at 12:39 PM Eli Friedman <efri...@quicinc.com> wrote:
>
>
>
> From: Xinliang David Li <xinli...@gmail.com>
> Sent: Wednesday, September 25, 2019 5:58 PM
> To: Eli Friedman <efri...@quicinc.com>
> Cc: Sriraman Tallam <tmsr...@google.com>; llvm-dev <llvm...@lists.llvm.org>
> Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations
>
>
>
>
>
>
>
> On Wed, Sep 25, 2019 at 5:02 PM Eli Friedman via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> My biggest question about this architecture is about when propeller runs basic block reordering within a function. It seems like a lot of the complexity comes from using the proposed -fbasicblock-sections to generated mangled ELF, and then re-parsing the mangled ELF as a separate step. I'm not sure that's the right approach, long-term.
>
> Splitting every basic block into its own section introduces overhead, like you described. And it's likely more complex on non-x86 targets, which have a greater variety of conditional branches. And the reordering itself involves a bunch of x86 and ELF-specific code.
>
> I'd like to suggest an alternative: instead of perform basic block reordering and function splitting by manipulating the ELF files, you could perform reordering and splitting as part of link-time code generation, as an MIR pass that runs just before the assembly printer. MIR is almost exactly the form you want for this sort of manipulation: it has basic blocks which correspond closely to the final binary, and a high-level representation of branch instructions.
>
>
>
> This was considered for Propeller. This is currently being explored in a similar way as an alternative of CSFDO which uses PMU samples.
>
>
>
> Makes sense.
>
>
>
> And it's before the DWARF/CFI emission, so you don't need to worry about fixing them afterwards. This should take less code overall, and much less target-specific code. And infrastructure for function splitting would be useful for non-Propeller workflows.
>
> There are some minor downsides to this approach I can think of. You lose a little flexibility, in that you can't mix blocks from different functions together, but you aren't doing that anyway, from your description?
>
>
>
> One of the main design objectives of Propeller is to have the capability to do interprocedural code transformations (reordering, cloning, dedupping etc), so this won't be a minor downside. Function/block alignment (for branch misprediction reduction etc) will also need to be done as a global optimization in the future.
>
>
>
> Okay, so my suggestion doesn’t work. I’m still concerned the proposed design is going to push us in a direction we don’t want to go. Particularly, if you’re going to attempt more complicated transforms, the problems caused by the limited information available in an ELF file will become more prominent. I mean, yes, you can come up with a more complicated implicit contract between the compiler and Propeller about the exact format of Propeller basic blocks, and add additional symbol annotations, and eventually come up with an “IR” that allows Propeller to perform arbitrary code transforms. But that’s inevitably going to be more complicated, and harder to understand, than a compiler IR designed for optimizations.

Thanks for the feedback, I am not sure I fully understand your
concerns but let me try to make some of the things clearer:

* Propeller relinks. Specifically, it regenerates ELF object files
from MIR. Even if MIR were serializable, we would still be starting
before CFI instruction inserter pass and then regenerate the native
ELF objects.
* All the code transformations we are planning to do for futuristic
optimizations is on the MIR just like you noted in a previous email.
For example, prefetch insertion was one optimization we were looking
at where the input is inserting prefetches at specific points in the
binary which we will translate to basic blocks at MIR and then insert
them there. So, I don't think we will be limited by ELF files.
* In comparison with BOLT, BOLT disassembles to MCInst to apply some
of these optimizations and we get to start from MIR without the cost
of disassembly.
* I think your concern is coming from looking at the linker
relaxations we perform for X86_64 and wondering if this would scale
for RISC, etc. We have not looked at RISC but chatting with Rui
(ruiu@) briefly, this looked like it is doable even without using
thunks.

Thanks
Sri

>
>
>
> If we are going to stick with ELF-as-compiler-IR, I’d like to see more careful documentation of the contract between the compiler and the Propeller linker, so it’s clear what the compiler is/is not allowed to emit at compile-time, and so other compilers could be made compatible in the future. The current Propeller implementation is doing some code rewrites that can’t be justified by just standard ELF section reordering.
>
>
>
> -Eli

Eli Friedman via llvm-dev

unread,

Sep 26, 2019, 8:13:07 PM9/26/19

to Sriraman Tallam, llvm-dev

Now I'm confused. Why are you regenerating the ELF files?

I thought the workflow was something like the following:

1. The compiler (or LTO codegen) generates ELF files with basic block sections.
2. You link once without propeller optimizations.
3. You collect the propeller profile.
4. You use the same ELF files to relink with propeller optimizations.

That's sufficient to implement all the optimizations you described, as far as I can tell.

But it sounds like that's wrong? It sounds like what you're describing is:

1. LTO codegen generates MIR files.
2. You convert the MIR files to ELF object files, and link without propeller optimizations. (Does this step have any propeller-specific code generation differences?)
4. You collect the propeller profile.
5. You apply some propeller optimization to the MIR files.
6. You convert the optimized MIR files to ELF object files, with basic block sections enabled.
7. You link the ELF files, applying propeller reordering optimizations.

(And since you can't serialize MIR, you actually serialize LLVM IR, and run the code generator when you deserialize, to convert that to MIR.)

Is that correct?

If you have the MIR at the time you're making all the propeller optimization decisions, why is the linker rewriting raw x86 assembly, as opposed to performing the relevant transforms on MIR?

Why are you proposing to add a bunch of options to clang to manipulate LLVM code generation, given none of those options are actually used for Propeller workflows?

Sriraman Tallam via llvm-dev

unread,

Sep 26, 2019, 8:31:44 PM9/26/19

to Eli Friedman, llvm-dev

TLDR; We are regenerating ELF files from optimized IR to keep the
cost of generating basic block sections low.

If what you describe above is done, where we generate basic block
sections even before we profile, we must conservatively generate
sections for all functions. This is unnecessarily expensive and we
have shown that generating it on demand based on profiles is much
cheaper. Since the backend generation can be distributed or
parallelized, this is not expensive to do. If we could make the
cost of basic block sections in terms of size bloats cheaper we could
do what you just suggested here, that is, always build with basic
block sections for all basic blocks.

>
> I thought the workflow was something like the following:
>
> 1. The compiler (or LTO codegen) generates ELF files with basic block sections.

The correction here is that we generate first with basic block labels
and not sections.

> 2. You link once without propeller optimizations.

This is correct.

> 3. You collect the propeller profile.

This is correct.

> 4. You use the same ELF files to relink with propeller optimizations.

We use the same optimized IR files to regenerate the ELF files. We
could use optimized MIR here but serializability of MIR is not well
supported yet and is part of the larger plan. We only need to re-run
CFI instruction inserter in MIR. For future optimizations, we can
plug in here and make small code transformations for optimizations
like prefetch insertion. If we attempt to do basic block re-ordering
here, we would be limited to intra-procedural basic block reordering
which is sub-optimal. Hence, we generate basic block sections here
only for the sampled functions which is a small %age.

>
> That's sufficient to implement all the optimizations you described, as far as I can tell.
>
> But it sounds like that's wrong? It sounds like what you're describing is:
>
> 1. LTO codegen generates MIR files.
> 2. You convert the MIR files to ELF object files, and link without propeller optimizations. (Does this step have any propeller-specific code generation differences?)

Yes, like I said above, basic block labels are used here to identify
the basic blocks which are later used in the last step to annotate
profiles.

> 4. You collect the propeller profile.
> 5. You apply some propeller optimization to the MIR files.
> 6. You convert the optimized MIR files to ELF object files, with basic block sections enabled.

Right, technically this is high level IR now as MIR is not
serializable but this is ok for discussion.

> 7. You link the ELF files, applying propeller reordering optimizations.
>
> (And since you can't serialize MIR, you actually serialize LLVM IR, and run the code generator when you deserialize, to convert that to MIR.)
>
> Is that correct?

That is correct.

>
> If you have the MIR at the time you're making all the propeller optimization decisions, why is the linker rewriting raw x86 assembly, as opposed to performing the relevant transforms on MIR?

MIR is still one module at a time. We cannot do inter-procedural
basic block layout here. We can do much more advanced stuff at the
whole-program level in the linker. The relaxation code is the down
side.

For more details, We strongly considered this. We could run something
like a thin link in thin lto figure out the global layout and hand out
the relevant subsets of the global decision to each module. This
looked more complicated and the individual pieces from each module
should still be globally laid out again by the linker. This
constraints us on what we can do for layout and also does not work
well with future optimizations like global alignment like David
pointed out.

>
> Why are you proposing to add a bunch of options to clang to manipulate LLVM code generation, given none of those options are actually used for Propeller workflows?

Where do you suggest labelling and section options should exist? We
looked at basic block sections to be similar to function sections in
terms of option handling?

Thanks
Sri

Eli Friedman via llvm-dev

unread,

Sep 26, 2019, 9:16:49 PM9/26/19

to Sriraman Tallam, llvm-dev

> -----Original Message-----
> From: Sriraman Tallam <tmsr...@google.com>
> Sent: Thursday, September 26, 2019 5:31 PM
> To: Eli Friedman <efri...@quicinc.com>
> Cc: Xinliang David Li <xinli...@gmail.com>; llvm-dev <llvm...@lists.llvm.org>
> Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
> Optimizations
>

> For more details, We strongly considered this. We could run something
> like a thin link in thin lto figure out the global layout and hand out
> the relevant subsets of the global decision to each module. This
> looked more complicated and the individual pieces from each module
> should still be globally laid out again by the linker. This
> constraints us on what we can do for layout and also does not work
> well with future optimizations like global alignment like David
> pointed out.

Basically, you need to do branch relaxation etc. in the linker because the layout isn't sufficiently finalized until then? And you *only* need to rewrite branches this way; never anything else, because other instructions don't need relaxation?

On most targets, we do branch relaxation in the compiler, unlike x86 which does it in the assembler. I guess we could teach the compiler to emit some specific "relaxed" pattern, and teach the linker to reverse it. Might be a little tricky to handle certain odd cases well; currently, for example, on AArch64 we sometimes insert a cpsr clobber as part of relaxation. I'm not really happy with having to reimplement that for every target, but if we're sure it's limited to branches, I guess it's okay.

> > Why are you proposing to add a bunch of options to clang to manipulate LLVM
> code generation, given none of those options are actually used for Propeller
> workflows?
>
> Where do you suggest labelling and section options should exist? We
> looked at basic block sections to be similar to function sections in
> terms of option handling?

Generating bitcode with/without propeller doesn't actually affect the generated bitcode, right? So you could say that Propeller is enabled with "-Wl,--enable-propeller", and not change clang at all. I'm not a fan of adding options just because we can. If basic block sections are only used as a sort of secret handshake between the propeller compiler and the propeller linker, we can change that interface later, if we want; if we expose it as a clang option, we're committing to making basic block sections continue to work the same way until the end of time.

Xinliang David Li via llvm-dev

unread,

Sep 26, 2019, 10:03:08 PM9/26/19

to Sriraman Tallam, llvm-dev

MIR level transformations are not excluded and can be done in Propeller framework, just limited to module level optimizations. Anything Global/Interprocedural needs to happen at (real) link time with assistance of module level annotations.

MIR is still one module at a time. We cannot do inter-procedural
basic block layout here. We can do much more advanced stuff at the
whole-program level in the linker. The relaxation code is the down
side.

For more details, We strongly considered this. We could run something
like a thin link in thin lto figure out the global layout and hand out
the relevant subsets of the global decision to each module. This
looked more complicated and the individual pieces from each module
should still be globally laid out again by the linker.

This won't work well -- as cross module inlining has not yet happened. Not only will there be problems with profile annotation, all the profile context sensitives will be lost.

This
constraints us on what we can do for layout and also does not work
well with future optimizations like global alignment like David
pointed out.

>
> Why are you proposing to add a bunch of options to clang to manipulate LLVM code generation, given none of those options are actually used for Propeller workflows?

Propeller workflows include precise profile annotations, so the options are used there.

David

Kristof Beyls via llvm-dev

unread,

Sep 27, 2019, 3:37:31 AM9/27/19

to Sriraman Tallam, llvm-dev

Op vr 27 sep. 2019 om 02:32 schreef Sriraman Tallam via llvm-dev <llvm...@lists.llvm.org>:

>
> If you have the MIR at the time you're making all the propeller optimization decisions, why is the linker rewriting raw x86 assembly, as opposed to performing the relevant transforms on MIR?

MIR is still one module at a time. We cannot do inter-procedural
basic block layout here. We can do much more advanced stuff at the
whole-program level in the linker. The relaxation code is the down
side.

For more details, We strongly considered this. We could run something
like a thin link in thin lto figure out the global layout and hand out
the relevant subsets of the global decision to each module. This
looked more complicated and the individual pieces from each module
should still be globally laid out again by the linker. This
constraints us on what we can do for layout and also does not work
well with future optimizations like global alignment like David
pointed out.

Apologies for the naive question.

Why is MIR restricted to being only module at a time?

If the restriction of only being able to do per-module processing at the MIR level would go away, then all these optimizations could be done in a compile step, and no support would need to be added to a linker?

If the restriction to do cross-module optimization at the MIR level could be removed, would it become a better tradeoff to do this in an LTO compiler step rather than a linker step?

Thanks,

Kristof

Xinliang David Li via llvm-dev

unread,

Sep 27, 2019, 11:53:43 AM9/27/19

to Kristof Beyls, llvm-dev

LTO style optimization uses the monolithic model which Propeller moves away from. The design principle is as much as possible preparation work will be done at module level, while the global step is lean and mean.

For layout/splitting/alignment/packing, linker is the right platform to use. It sees/creates many things compiler does not, such as PLTs/stubs/trampolines. etc. Besides, using linker allows inter-procedural reordering of blocks in native object files as long as those files are compiled with the right annotations. For instance, we can do per application optimal layout for memcpy/memset functions from libc.a depending on the profile data for that app.

David

Thanks,

Kristof

Sriraman Tallam via llvm-dev

unread,

Sep 27, 2019, 12:43:29 PM9/27/19

to Eli Friedman, llvm-dev

On Thu, Sep 26, 2019 at 6:16 PM Eli Friedman <efri...@quicinc.com> wrote:
>
> > -----Original Message-----
> > From: Sriraman Tallam <tmsr...@google.com>
> > Sent: Thursday, September 26, 2019 5:31 PM
> > To: Eli Friedman <efri...@quicinc.com>
> > Cc: Xinliang David Li <xinli...@gmail.com>; llvm-dev <llvm...@lists.llvm.org>
> > Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
> > Optimizations
> >
> > For more details, We strongly considered this. We could run something
> > like a thin link in thin lto figure out the global layout and hand out
> > the relevant subsets of the global decision to each module. This
> > looked more complicated and the individual pieces from each module
> > should still be globally laid out again by the linker. This
> > constraints us on what we can do for layout and also does not work
> > well with future optimizations like global alignment like David
> > pointed out.
>
> Basically, you need to do branch relaxation etc. in the linker because the layout isn't sufficiently finalized until then?

Yes, this is correct. Relaxation is needed because the layout is not
finalized until link time.

And you *only* need to rewrite branches this way; never anything else,
because other instructions don't need relaxation?

I cannot think of any other instance where such relaxation is needed.

>
> On most targets, we do branch relaxation in the compiler, unlike x86 which does it in the assembler. I guess we could teach the compiler to emit some specific "relaxed" pattern, and teach the linker to reverse it. Might be a little tricky to handle certain odd cases well; currently, for example, on AArch64 we sometimes insert a cpsr clobber as part of relaxation. I'm not really happy with having to reimplement that for every target, but if we're sure it's limited to branches, I guess it's okay.

Understood.

>
> > > Why are you proposing to add a bunch of options to clang to manipulate LLVM
> > code generation, given none of those options are actually used for Propeller
> > workflows?
> >
> > Where do you suggest labelling and section options should exist? We
> > looked at basic block sections to be similar to function sections in
> > terms of option handling?
>
> Generating bitcode with/without propeller doesn't actually affect the generated bitcode, right? So you could say that Propeller is enabled with "-Wl,--enable-propeller", and not change clang at all. I'm not a fan of adding options just because we can. If basic block sections are only used as a sort of secret handshake between the propeller compiler and the propeller linker, we can change that interface later, if we want; if we expose it as a clang option, we're committing to making basic block sections continue to work the same way until the end of time.

The generated MIR code is slightly different as the later passes have
more CFI instructions, basic block labels and extra direct jumps which
would have been fall throughs otherwise. So, some llvm option is
needed.

I envisioned that basic block sections could also be useful on its own
outside of propeller. Hence, we made it like function-sections with
a separate option -fbasicblock-sections. The propeller option is an
umbrella option to make it easy to invoke Propeller.

Thanks
Sri

Eli Friedman via llvm-dev

unread,

Sep 27, 2019, 4:16:45 PM9/27/19

to Sriraman Tallam, llvm-dev

> -----Original Message-----
> From: Sriraman Tallam <tmsr...@google.com>
> Sent: Friday, September 27, 2019 9:43 AM
> To: Eli Friedman <efri...@quicinc.com>
> Cc: Xinliang David Li <xinli...@gmail.com>; llvm-dev <llvm...@lists.llvm.org>
> Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
> Optimizations
>
> >

> > > > Why are you proposing to add a bunch of options to clang to manipulate
> LLVM
> > > code generation, given none of those options are actually used for Propeller
> > > workflows?
> > >
> > > Where do you suggest labelling and section options should exist? We
> > > looked at basic block sections to be similar to function sections in
> > > terms of option handling?
> >
> > Generating bitcode with/without propeller doesn't actually affect the
> generated bitcode, right? So you could say that Propeller is enabled with "-Wl,--
> enable-propeller", and not change clang at all. I'm not a fan of adding options
> just because we can. If basic block sections are only used as a sort of secret
> handshake between the propeller compiler and the propeller linker, we can
> change that interface later, if we want; if we expose it as a clang option, we're
> committing to making basic block sections continue to work the same way until
> the end of time.
>
> The generated MIR code is slightly different as the later passes have
> more CFI instructions, basic block labels and extra direct jumps which
> would have been fall throughs otherwise. So, some llvm option is
> needed.

At link (LTO codegen) time, yes. But clang's bitcode generation doesn't change; only LTO code generation in lld changes.

> I envisioned that basic block sections could also be useful on its own
> outside of propeller. Hence, we made it like function-sections with
> a separate option -fbasicblock-sections. The propeller option is an
> umbrella option to make it easy to invoke Propeller.

I don't think this is really true. Basic block labels means "insert labels that are useful for propeller profiling", and basic block sections means "split the function into multiple chunks that are convenient for propeller optimization". So it's really only useful for propeller-like workflows. And I'd be concerned that future changes to propeller could affect the behavior of these options in the future, if the behavior isn't specified in some rigorous way.

In any case, if we're adding flags to clang other than "enable propeller", I think we should have a separate thread on cfe-dev to discuss them. We don't expose every possible LLVM optimization and code generation option through clang. Users have a strong expectation of stability for clang options, and that means we need to evaluate each new option carefully, to decide whether it's something we really want to support indefinitely.

Sriraman Tallam via llvm-dev

unread,

Sep 27, 2019, 5:08:06 PM9/27/19

to Eli Friedman, llvm-dev

I see what you mean here.

>
> > I envisioned that basic block sections could also be useful on its own
> > outside of propeller. Hence, we made it like function-sections with
> > a separate option -fbasicblock-sections. The propeller option is an
> > umbrella option to make it easy to invoke Propeller.
>
> I don't think this is really true. Basic block labels means "insert labels that are useful for propeller profiling", and basic block sections means "split the function into multiple chunks that are convenient for propeller optimization". So it's really only useful for propeller-like workflows. And I'd be concerned that future changes to propeller could affect the behavior of these options in the future, if the behavior isn't specified in some rigorous way.

The idea of basic block sections was seeded by Rui Ueyama. When basic
block sections was originally looked at, Propeller was not designed.
We looked at basic block sections as finer granularity than function
sections. Section prefix based code layout where you just create
smaller code sections and let the linker do what it does today would
have much better results with basic block sections. Symbol ordering
file with basic block sections to do random orderings can be done
without invoking Propeller. Function sections has found uses after it
was invented like Identical Code Folding.

>
> In any case, if we're adding flags to clang other than "enable propeller", I think we should have a separate thread on cfe-dev to discuss them. We don't expose every possible LLVM optimization and code generation option through clang. Users have a strong expectation of stability for clang options, and that means we need to evaluate each new option carefully, to decide whether it's something we really want to support indefinitely.

Understood.

Thanks
Sri

Eric Christopher via llvm-dev

unread,

Sep 28, 2019, 1:36:55 AM9/28/19

to Sriraman Tallam, llvm-dev

I'm not sure that function sections are entirely an apt comparison
here for a few reasons:

- function sections mostly gave an easy way to do ICF - using MIR or
LLVM IR are also fairly effective means as is just using the linker to
compare bytes. Using function sections can still run into language
specific problems as well, hence why icf=safe vs icf=all. Though pcc's
recent work helps out a little bit via significant address
determination in the front end
- order files and other similar mechanisms work based on externally
visible symbols and are fairly straightforward to use in an ABI safe
way (no function -> function fallthrough for example)
- basic block sections can run into the multiple entry point problems
depending on how they're laid out as well as other ABI and visibility
issues

This can likely be done safely, but there are definitely some concerns
here with how you want to enable the use of basic block
labels/sections.

-eric

Sriraman Tallam via llvm-dev

unread,

Sep 28, 2019, 11:12:09 AM9/28/19

to Eli Friedman, llvm-dev

After having spent more time on this, I think what you are suggesting is to impose this work-flow for Propeller always:

Step 1: generate labels binary

clang -emit-llvm -c -O2 foo.cc bar.cc (generates optimized foo.bc and bar.bc)

./lld foo.bc bar.bc --basicblock-labels -o a.out.labels (generates labels binary)

Step 2: Profile into perf.data just like now

Step 3: Convert profiles just like now

Step 4: Re-optimize with propeller by invoking lld directly

./lld foo.bc bar.bc --basicblock-sections --profile=perf.propeller -o a.out

If this workflow is adopted we might be able to get away without clang options but I still think Step 1 could use a clang option to combine the 2 steps into one and may be one more option in Step 4 might be useful if we want the invocation to be via clang.

Notice how this work-flow is very similar to ThinLTO's, and even thinlto needs command-line options, -flto=thin, to break this down into the steps.

However, the issue with this work-flow imposes LTO like workflow on Propeller. Propeller is LTO agnostic and works even without LTO or regular FDO. The above does not support the simple workflows where you build each module with -O3 and then dump the native object files and link. I think it would be beneficial to allow Propeller to be used everywhere by modifying the make files to adopt it rather than constrain the work flow.

Thanks

Sri

Sriraman Tallam via llvm-dev

unread,

Sep 28, 2019, 11:25:45 AM9/28/19

to Eric Christopher, llvm-dev

I did not mean to suggest we should do folding with basic block sections and I am aware of the issues, though I am not ruling out the possibility. We had briefly brainstormed this.

AFAICT, Function Sections was not invented to do ICF. Function sections was used for linker garbage collection and global code layout and when we first worked on ICF for gold much later, we found function sections to be useful without which it would have been much harder.

compare bytes. Using function sections can still run into language
specific problems as well, hence why icf=safe vs icf=all. Though pcc's

Agreed, and I would like to humbly suggest that I added those options to gold originally and am aware of that.

recent work helps out a little bit via significant address
determination in the front end
- order files and other similar mechanisms work based on externally
visible symbols and are fairly straightforward to use in an ABI safe
way (no function -> function fallthrough for example)
- basic block sections can run into the multiple entry point problems
depending on how they're laid out as well as other ABI and visibility
issues

This can likely be done safely, but there are definitely some concerns
here with how you want to enable the use of basic block
labels/sections.

I would like to expose basic block sections independently and let me talk more about one more effort we are trying to do. Independent of Propeller, we are looking at a machine learning based approach to do code layout at link time. I won't go into the details of the features but having basic block sections allows us to naturally feed the native object files into a black box that would try several different reorderings of the sections and learn a model. We looked at basic block sections as the building block on top of which Propeller was built. I don't think it would be the only application, just my opinion.

Thanks

Sri

Sriraman Tallam via llvm-dev

unread,

Sep 28, 2019, 11:45:53 AM9/28/19

to Eric Christopher, llvm-dev

I missed this point. Just to be clear, With the linker relaxation, a simple order file of basic block symbols can be used to re-order basic blocks in any manner. This can be done with the existing patches *now* and we have used this extensively in our work to validate our layout decisions.

Eric Christopher via llvm-dev

unread,

Sep 30, 2019, 3:26:28 PM9/30/19

to Sriraman Tallam, llvm-dev

They're both in the same arena.

>
>>
>> compare bytes. Using function sections can still run into language
>> specific problems as well, hence why icf=safe vs icf=all. Though pcc's
>
>
> Agreed, and I would like to humbly suggest that I added those options to gold originally and am aware of that.

Of course, but context is helpful for discussion and for others in the
future understanding the discussion.

>
>>
>> recent work helps out a little bit via significant address
>> determination in the front end
>> - order files and other similar mechanisms work based on externally
>> visible symbols and are fairly straightforward to use in an ABI safe
>> way (no function -> function fallthrough for example)
>> - basic block sections can run into the multiple entry point problems
>> depending on how they're laid out as well as other ABI and visibility
>> issues
>>
>>
>> This can likely be done safely, but there are definitely some concerns
>> here with how you want to enable the use of basic block
>> labels/sections.
>
>
> I would like to expose basic block sections independently and let me talk more about one more effort we are trying to do. Independent of Propeller, we are looking at a machine learning based approach to do code layout at link time. I won't go into the details of the features but having basic block sections allows us to naturally feed the native object files into a black box that would try several different reorderings of the sections and learn a model. We looked at basic block sections as the building block on top of which Propeller was built. I don't think it would be the only application, just my opinion.
>

Again a question of "why at the object level rather than a low level
IR" seems like something we should really explore the tradeoffs for at
more length.

-eric

Sriraman Tallam via llvm-dev

unread,

Sep 30, 2019, 3:32:16 PM9/30/19

to Eric Christopher, llvm-dev

This is so that we can do inter-procedural basic block ordering. Doing this at link time with all the basic blocks exposes a lot more inter-procedural ordering opportunities. At a low-level IR, we are still limited to intra-module basic block reordering.

Thanks

Sri

Eric Christopher via llvm-dev

unread,

Sep 30, 2019, 4:27:14 PM9/30/19

to Sriraman Tallam, llvm-dev

Seems like you could spend just as much work and reliability on making
MIR/late stage LTO compilation handle this for you?

-eric

Sriraman Tallam via llvm-dev

unread,

Sep 30, 2019, 4:34:38 PM9/30/19

to Eric Christopher, llvm-dev, Xinliang David Li

Full LTO yes, Thin LTO ... no. We want this to be scalable and ThinLTO is what we are more interested in.

Teresa Johnson via llvm-dev

unread,

Sep 30, 2019, 4:39:59 PM9/30/19

to Sriraman Tallam, llvm-dev, Xinliang David Li

As Sri said, we need to be able to use this for ThinLTO, and I'm not sure how you would do whole program BB layout in ThinLTO which does the whole program analysis much much earlier than MIR. Eric - did you have something else in mind or were you talking about Full LTO?

Teresa

Thanks
Sri

-eric

> Thanks
> Sri
>
>
>>
>>
>> -eric
>>
>> > Thanks
>> > Sri
>> >
>> >
>> >
>> >
>> >>
>> >>
>> >> -eric
>> >>
>> >>
>> >> > >
>> >> > > In any case, if we're adding flags to clang other than "enable propeller", I think we should have a separate thread on cfe-dev to discuss them. We don't expose every possible LLVM optimization and code generation option through clang. Users have a strong expectation of stability for clang options, and that means we need to evaluate each new option carefully, to decide whether it's something we really want to support indefinitely.
>> >> >
>> >> > Understood.
>> >> >
>> >> > Thanks
>> >> > Sri
>> >> >
>> >> > >
>> >> > > -Eli
>> >> > _______________________________________________
>> >> > LLVM Developers mailing list
>> >> > llvm...@lists.llvm.org
>> >> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

--

Teresa Johnson |

Software Engineer |

tejo...@google.com |

Xinliang David Li via llvm-dev

unread,

Sep 30, 2019, 4:50:08 PM9/30/19

to Teresa Johnson, llvm-dev

I guess Eric means full program optimization/cross module optimization using MIR. This is in theory workable in full LTO style, but not in ThinLTO style which works on summary data. As we have discussed, eliminating monolithic style optimization is the key design goal.

This was also briefly discussed in one of the previous replies I sent. There are other benefits of doing this in linker such as handling prebuilt libraries such as libc.

David

James Y Knight via llvm-dev

unread,

Sep 30, 2019, 6:08:01 PM9/30/19

to Xinliang David Li, llvm-dev

One could imagine using MIR as an "object-file" format from the compiler to the linker -- and then within the linker, doing block layout on the MIR and lowering to actual instructions.

I don't know that I'd say that's simpler than what you've described, however.

Teresa Johnson via llvm-dev

unread,

Sep 30, 2019, 6:10:50 PM9/30/19

to James Y Knight, llvm-dev, Xinliang David Li

On Mon, Sep 30, 2019 at 3:07 PM James Y Knight <jykn...@google.com> wrote:

One could imagine using MIR as an "object-file" format from the compiler to the linker -- and then within the linker, doing block layout on the MIR and lowering to actual instructions.

Sure, but this moves some of the currently parallel compilation (in ThinLTO at least) into a monolithic phase (the linker), which is what Propeller is trying to avoid to achieve scalability. Teresa

Maksim Panchenko via llvm-dev

unread,

Oct 2, 2019, 2:41:22 PM10/2/19

to Sriraman Tallam, llvm...@lists.llvm.org

Hi Sri and the team,

Thank you for sharing your proposal. It helps bring awareness to the importance
of context-sensitive and link-time optimizations in the compiler toolchain for
boosting the performance of large applications. The proposal also shows that
different approaches are possible to achieve a similar goal.

Putting basic blocks into separate sections is an ambitious approach with
expected challenges. For the first time, the real overheads of this approach
were measured at scale, and the results look quite staggering. I can imagine
how it might work for Google where the overhead of the 2-stage re-compilation
could be masked out by a distributed build system, and dealing with C++
exceptions is not an issue due to software development practices. At the same
time, it should be clear that for users without access to a distributed build
system, the actual build-time overhead will far exceed that of BOLT.
Considering the proposal in its current form, I'm not convinced it's the best
approach even for Google in the long run.

Since the proposal references BOLT in multiple places, and since I’m directly
involved with BOLT and hence potentially biased, I decided to break my feedback
into two parts, with the first part being unrelated to BOLT and hopefully being
as objective as possible.

Here’s a quick summary of my concerns, followed by more detailed feedback.

PERFORMANCE OF TARGET APPLICATIONS
* Poor optimization of C++ code bases with exceptions
* Pessimization/overhead for stack unwinding used by system-wide profilers and
for exception handling
* Increased memory footprint at application runtime
* No clear plan for adding optimizations in the future without the need for
more RFCs
* Effectiveness of the relaxation pass
* Inconsistencies in performance data
* Debugging overhead

BUILD SYSTEM
* Requirement for two different builds and hidden overhead of Propeller
* Step 1 overheads
* Enormous increase in object file size
* Lack of scalability for actual code reordering during the optimization phase

PROFILING
* Propeller-optimized binary for continuous deployment

PERFORMANCE OF TARGET APPLICATIONS

*Poor optimization for C++ code bases with exceptions*

From what I understand, your list of large real-world applications (you exclude
SPEC from the list, which I totally agree with) does not include a single one
that uses C++ exceptions. This is based on the assumption that exceptions are
not used at Google, and Clang is compiled without the exceptions support by
default. I consider this is a major omission from the evaluation and a drawback
of the proposal.

A lot of exception-handling code is generated by compiler “behind the scenes”
and is invisible to the user, but is exposed to Propeller. Even if such code is
never executed, i.e. exceptions are not thrown, it is important to be able to
reorder the code in the function including these blocks. The RFC says: “basic
blocks that are involved in exception handling, that is, they either throw or
catch an exception, are grouped together and placed in a single section. Unique
sections are not created for such basic blocks.” Further down the paragraph,
you say: “benchmarks which spend a significant amount of time handling
exceptions might not get the optimization benefits of Propeller.“ The proposal,
intentionally or not, makes it look like applications that are compiled with
exceptions support and that only use them rarely, i.e. for error-handling, are
not affected. Based on the limitations you describe above, I believe the code
reordering and thus optimization benefits will be limited for such
applications.

If you cannot find any large real-world benchmark that uses C++ exceptions, at
the very minimum, I would like to see Clang itself being compiled with
exceptions support and optimized by Propeller. Granted, it will not exercise
any exception-handling code, but at least the results will give an idea of how
well Propeller handles optimizing code with the mechanism enabled.

*Pessimization/overhead for stack unwinding used by system-wide profilers and
for exception handling*

Larger CFI programs put an extra burden on unwinding at runtime as more CFI
(and thus native) instructions have to be executed. This will cause more
overhead for any profiler that records stack traces, and, as you correctly note
in the proposal, for any program that heavily uses exceptions.

*Increased application memory footprint*

The increase of .eh_frame section is up to 17x, which is quite significant.
Since it’s part of the runtime image of the application, it is important to
know how much larger the application memory footprint gets. I.e., in addition
to the “Total” size of the binary in the comparison table, I would like to see
“Total Allocatable” size data.

*No clear plan for adding optimizations in the future without the need for
additional RFCs*

Some optimizations other than basic block reordering are best executed at link
time. One example would be an optimization for macro-op fusion on x86-64 CPUs.
We’ve seen a real-world application, bzip2, regressed by 5% in the case of
“unlucky” code placement. The optimization requires an analysis of instructions
(in this case, instruction pairs) and modification of the instruction stream
such as insertion of nops. For the optimization to be performed in the
compiler, it would require functions to be aligned at 64 bytes which is
generally not practical, hence an opportunity for link-time/binary
optimization. What is the plan for Propeller to handle the above and similar
cases?

*Effectiveness of the relaxation pass*

When you describe the relaxation pass, you do not mention if you run it till it
converges, or you run it conservatively leaving larger than necessary versions
of jump instructions.

*Inconsistencies in performance data*

Performance data in bar charts indicate that Propeller is not achieving all
possible improvements for Clang, the only open-source benchmark in the
evaluation where code-layout optimizations matter. Cycles, I-Cache, and
branches-not-taken indicate that BOLT is doing better. At the same time, the
summary table shows 9% improvement for all. Which data set is correct?

*Debugging overhead*

As Propeller has to support debug info generation on a per-basic-block basis,
it increases DWARF sections corresponding to address ranges, location lists,
and line number info. How much larger are these sections with Propeller flags?
We are currently pushing the limits of 32-bit DWARF, and I wonder if you
encounter any related issues considering the Propeller bloat?

With larger debug info, the overhead flows to debuggers and tools using debug
information for annotation purposes, e.g. profilers reporting line numbers
corresponding to binary addresses. Depending on their usage of the debug info,
these tools will likely use more memory and spend more time parsing DWARF. Have
you considered/measured this overhead?

By the way, I didn’t see DWARF location lists specifically mentioned in the
proposal. Making sure you use extra entries for those too.

BUILD SYSTEM

*Requirement for two different builds and hidden overhead of Propeller*

While discussing Memory Overhead and Runtime Overhead in the RFC, you chose to
include the overhead incurred only during link time. However, compared to a
regular highly-optimized AutoFDO+(Thin)LTO build, you require a second full
build of the application with (Thin)LTO. Even with the caching of IR, there’s
an extra cost of generating code not reflected in your evaluation. With tight
integration with the build system, it is possible to hide these costs
partially. In either case, since “all the benchmarks were built for the X86_64
platform on a 18 core Intel machine with 192 G of RAM“, you should be able to
measure extra time for compilation spent on this machine.

*Step 1 overheads*

Furthermore, wouldn’t the LLVM compiler incur extra overhead when compiling
with Propeller flags because of larger CFIs, debug info, etc.? I think it would
be fair to measure and report the resulting compile-time and memory overhead
for both builds (steps 1 and 4) versus regular build.

Additionally, what are runtime and memory overheads for linking in Step 1? Do
they correspond to the “All” column in the Experiments section? If you chose to
compare just the link-time overhead versus the baseline, it should come from
the combination of link times of steps 1 and 4 since you have to link twice for
Propeller.

*Enormous increase in object file size*

Even with the optimized on-demand approach, the size of object files almost
doubles. Since in a distributed build system object files are cached and
shared, I’m not sure it is fair to describe this approach as being “friendly to
distributed build systems” since it increases the existing load.

What is the breakdown of the increase in object file sizes? As you mention in
the proposal, DWARF natively supports address ranges. However, specifying a
separate range for every basic block will introduce an overhead. Is this
increase included in the total overhead for the “Object File Sizes Bloat”?

*Lack of scalability for actual code reordering during the optimization phase*

When you talk about Propeller being a distributed and scalable system, you
mention that it has “a distributed part and a whole-program part”. Does this
mean that the part where you need to recompile the application is distributed
(if you are using a distributed build system), but the actual link and the code
reordering optimization are happening serially?

PROFILING

Real-world systems built to benefit from profile feedback optimizations, such
as AutoFDO, are designed to benefit from a feedback collected on a previous
revision of the highly optimized binary running in prod. Does Propeller support
profiling binaries optimized by Propeller with the purpose of future
optimizations?

BOLT

Now to the BOLT part.

As you know, BOLT was originally designed to handle applications built using
arbitrary toolchain without any modification to the toolchain itself or the
build process (other than “--emit-relocs” linker flag for maximum gains).
Integration with any particular toolchain, such as LLVM+lld, obviously brings
immediate advantages to the design and thus, at least in theory, comparing BOLT
to such a system would not be apples-to-apples. However, I think BOLT stands
fairly competitive even in its current form. With a few enhancements and a
little help from the linker, issues raised in the Propeller proposal could be
addressed without the need for the radical section-per-basic-block approach.

Recently (July 2019), we have parallelized many passes in BOLT and improved its
runtime. If you run experiments using an old version, you should try a more
recent one. For example, optimizing Clang 7.0, with “.text” of ~50MB, takes
less than one minute. Hardly a use case requiring any further optimization and
justifying a need for a distributed system.

One of your main arguments is that Propeller has 20% memory footprint of that
of BOLT. This is achieved by limiting optimization to profiled functions, i.e.
typically less than 10% of all functions. BOLT’s main memory overhead comes
from the intermediate representation of all functions in the binary in the form
of MCPlus. If we decide to keep IR only for functions with profile information,
i.e. less than 10% of all functions, there’s fundamentally no reason why BOLT’s
memory usage wouldn’t get decimated and brought on-par with Propeller’s link
time, perhaps even less. We never found that to be a restriction in our
environment and haven’t heard it been an issue other than from you.

The following statement about BOLT in your proposal does not sound right: “with
BOLT, even small source changes invalidate the profile information, an error if
the binary’s build id does not match the profile [sic]. Hence, the binary
corresponding to the source change must be re-built and re-profiled.“ Quite the
opposite, BOLT is built to tolerate binary differences, including those
resulting from LTO, and even supports profiling previously BOLTed binaries. The
only place we check for the build ID is during the profile conversion phase
corresponding to Step 3 in your proposal. This is done to avoid any possible
user error, as the collection (Step 2) and conversion often happen on different
machines. It might be a good idea to implement this check in Propeller too.

Comparison on “Search1” includes data gathered with huge pages enabled for
Propeller and disabled for BOLT. For a direct comparison, you can include a
line with and without huge pages. If you cannot enable huge pages for BOLT, it
is fair to exclude the result.

I should also note that BOLT performs a set of optimizations beyond the code
layout that can benefit from context-sensitive profile information, such as
indirect call promotion, inlining, shrink-wrapping etc. For the full set of
optimizations implemented in BOLT, please check our CGO’19 paper
(https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/).

BOLT’s additional benefits include a support for code instrumentation and code
protection/hardening. E.g. spectre/meltdown mitigation including that for
assembly-written code.

Obviously, this is not the right place for an alternative proposal, but I
strongly believe that integrating BOLT with a linker, such as lld, would
address the majority of the issues and provide a seamless solution to the users
of the LLVM toolchain without the need for radical changes to the emitted code
and LLVM codegen. If the scalability is indeed a concern, which so far is not
what we’ve seen, then we might consider changing LLVM itself to emit more data
or add integration to a distributed build system. But again, IMHO this would be
a solution to a nonproblem.

https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1809.04676&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=_GpIFlYp6xXw55TW32G_Z4B6vhpncSis7Vdf3sutEf8&s=2jV6DGC8za8IdgV-438b6TzUOsRienUX5hcyLbsHYag&e=

+ Added support in the linker to re-order basic block sections in any
specified sequence (can be done with symbol ordering file). This requires
linker relaxations to delete and shrink branches across basic blocks.

+ Compared our system to BOLT and have shown that our system can produce
similar performance improvements as BOLT but with much less memory and time
overheads. Our experiments are on very large warehouse-scale benchmarks and
SPEC 2017.

+ Listed why this cannot be done as part of PGO itself. Post Link
optimizations are able to transform the binary using accurate profiles and PGO
passes suffer from profile imprecision.

+ Updated DebugInfo and CFI to account for arbitrary ordering of basic blocks
via basic block sections.

+ Discussed CFI handling and is sub-optimal and bloats object file sizes and
binary sizes much more than DebugInfo due to lack of support for discontiguous
address ranges. We have talked about this and would like to make a case to
support discontiguous ranges with CFI which will make basic block sections much
more cheaper.

Detailed RFC document here :
https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf

Please let us know what you think,
Thanks
Sri on behalf of the team.

Krzysztof Pszeniczny via llvm-dev

unread,

Oct 2, 2019, 3:20:09 PM10/2/19

to Maksim Panchenko, llvm...@lists.llvm.org

On Wed, Oct 2, 2019 at 8:41 PM Maksim Panchenko via llvm-dev <llvm...@lists.llvm.org> wrote:

*Pessimization/overhead for stack unwinding used by system-wide profilers and
for exception handling*

Larger CFI programs put an extra burden on unwinding at runtime as more CFI
(and thus native) instructions have to be executed. This will cause more
overhead for any profiler that records stack traces, and, as you correctly note
in the proposal, for any program that heavily uses exceptions.

The number of CFI instructions that have to be executed when unwinding any given stack stays the same. The CFI instructions for a function have to be duplicated in every basic block section, but when performing unwinding only one such a set is executed -- the copy for the current basic block. However, this copy contains precisely the same CFI instructions as the ones that would have to be executed if there were no basic block sections.

--

Krzysztof Pszeniczny

Maksim Panchenko via llvm-dev

unread,

Oct 2, 2019, 4:25:17 PM10/2/19

to Krzysztof Pszeniczny, llvm...@lists.llvm.org

Thanks for clarifying. This means once you move to the next basic block (or any other basic

block in the function) you have to execute an entirely new set of CFI instructions

except for the common CIE part. While indeed this is not as bad, on average, the overall

active memory footprint will increase.

Creating one FDE per basic block means that .eh_frame_hdr, an allocatable section,

will be bloated too. This will increase the FDE lookup time. I don’t see .eh_frame_hdr

being mentioned in the proposal.

Maksim

James Y Knight via llvm-dev

unread,

Oct 2, 2019, 4:59:35 PM10/2/19

to Maksim Panchenko, llvm...@lists.llvm.org

I'm a bit confused by this subthread -- doesn't BOLT have the exact same CFI bloat issue? From my cursory reading of the propellor doc, the CFI duplication is _necessary_ to represent discontiguous functions, not anything particular to the way Propellor happens to generate those discontiguous functions.

And emitting discontiguous functions is a fundamental goal of this, right?

_______________________________________________

Rafael Auler via llvm-dev

unread,

Oct 2, 2019, 6:18:45 PM10/2/19

to James Y Knight, Maksim Panchenko, llvm...@lists.llvm.org

You’re correct, except that, in Propeller, CFI duplication happens for every basic block as it operates with the conservative assumption that a block can be put anywhere by the linker. That’s a significant bloat that is not cleaned up later. So, during link time, if N blocks from the same function are contiguous in the final layout, as it should happen most of the time for any sane BB order, we would have several FDEs for a region that only needs one. The bloat goes to the final binary (a lot more FDEs, specifically, one FDE per basic block).

BOLT will only split a function in two parts, and only if it has profile. Most of the time, a function is not split. It also has an option not to split at all. For internally reordered basic blocks of a given function, it has CFI deduplication logic (it will interpret and build the CFI states for each block and rewrite the CFIs in a way that uses the minimum number of instructions to encode the states for each block).

Sriraman Tallam via llvm-dev

unread,

Oct 2, 2019, 10:24:55 PM10/2/19

to James Y Knight, llvm...@lists.llvm.org, David Blaikie

Maks and team, thanks for the detailed feedback and we will address all of your
concerns. Let’s begin with CFI and DebugInfo first since this is already
being discussed.

TLDR; clang is pathological and the actual CFI bloat will go down from 7x to
2x.

Let us present the CFI Bloats for each benchmark with the default option, which
is creating basic block sections only for functions with samples. For clang,
it is 7x and *not 17x* (the 17 x number is for all sections), and for the
large benchmarks it is less than 30% or 1.3x. For large benchmarks, Storage is
the highest, going from 18M to 23M, ~30%. Clang is almost pathological here.
This is because 10% of all functions in clang have samples (touched by the
profile information), and all of these functions get full basic block sections.
Whereas, for the large benchmarks, only 0.5% of functions have samples. Now,
for clang, 10% of functions that have samples also happen to contain 35% of all
basic blocks. This means, we are creating sections for 35% of all basic blocks
and the CFI bloats are clearly showing.

Now, the data for clang also shows that only 7% of the basic blocks have
samples. We are working on restricting basic block sections to only those basic
blocks that have samples. The rest of the basic blocks (cold) in that function
would share the same section. With this, we would reduce the bloat of CFI
from 7x to 2x. This is not hard to do and we will follow up with this patch.

Also for object size bloats with regards to eh_frame, the reasoning is similar.
Restricting the section creation to only basic blocks that have profiles will
reduce this a lot more.

Importantly, if CFI support were available for discontiguous ranges we wouldn't
have to duplicate CFI FDEs and the bloats would be near minimal.

BOLT parses CFI and DWARF and generates compact information by rewriting it.
Whereas, Propeller uses lld which uses relocations and sections to fixup but
does not rewrite it. This is by design and that lld is not DWARF and CFI
aware. We designed basic block sections just like function sections. The
compiler produces a bunch of sections and relocations. The linker patches the
relocations and the debug info and CFI are right, that's it. For CFI, since
there is no support for discontiguous ranges we have to duplicate and dedup
FDEs only for blocks with sections. We are asking that CFI support
discontiguous ranges and this would look even simpler. Alternately, if lld
were made DWARF and CFI aware we could rewrite it compactly like BOLT.
These would help with object size bloats and binary size bloats.

Sriraman Tallam via llvm-dev

unread,

Oct 7, 2019, 2:16:18 PM10/7/19

to James Y Knight, llvm...@lists.llvm.org, David Blaikie

We would also like to clarify on the misconceptions around CFI Instructions:

There are two things that need to be clarified here:

1) Extra CFI FDE entries for basic blocks does not mean more dynamic
instructions are executed. In fact, they do not increase at all. Krys
talked about this earlier.
2) We do deduplication of common static CFI instructions in the FDE
and move it to the CIE . Hence, moving to a new basic block does not
mean a completely new set of CFI instructions is executed.

Sriraman Tallam via llvm-dev

unread,

Oct 7, 2019, 2:18:24 PM10/7/19

to James Y Knight, llvm...@lists.llvm.org, David Blaikie

The next concern we want to address is about C++ Exceptions which was
considered a major omission:

TLDR; This can be implemented in propeller easily. It was considered
low priority because exception handling code is assumed in general not
performance critical .

Currently, we do not create basic block sections for any block that is
involved in exception handling. There is nothing fundamentally
difficult about exception handling basic blocks. The exception table
uses address offsets which currently does not use relocations, and we
must modify it to use relocations in order to enable sections for
exception handling basic blocks.

We will re-prioritize and send out a patch to handle exception basic blocks.

Thanks
Sri

Sriraman Tallam via llvm-dev

unread,

Oct 8, 2019, 4:19:46 PM10/8/19

to James Y Knight, llvm...@lists.llvm.org, David Blaikie

Some more information about the relaxation pass whose effectiveness
and convergence guarantees were listed as a concern:

TLDR; Our relaxation pass is similar to what LLVM’s MCAssembler does
but with a caveat for efficiency. Our experimental results show it is
efficient and convergence is guaranteed.

Our relaxation pass is very similar to what MCAssembler does as it
needs to solve the same problem, which is laying out basic blocks and
optimizing branch instruction sizes based on jump offset. The optimal
algorithm to do this is quadratic and it involves computating of
section offsets multiple times. Our relaxation pass has a caveat for
efficiency. Our relaxation pass recomputes section offsets only after
all the sections have been processed. This is repeated until
convergence, whereas, the MCAssembler re-computes section offsets
every time a section is modified. Our pass works well in the linker as
it minimizes the section offset recomputation, which is more expensive
to do in the linker (whole program) than doing it in the assembler
(per function).

We shrink jump instructions aggressively and then fix the wrongly
shrunk branches by running a pass that grows them back. We guarantee
convergence because we always shrink in the first round and always
grow in the second.

Data shows that our relaxation is efficient. For example, for clang,
the aggressive shrinking affects ~995000 branches in 6 iterations and
only 4 branches had to be grown back.

Sriraman Tallam via llvm-dev

unread,

Oct 9, 2019, 5:11:36 PM10/9/19

to James Y Knight, llvm...@lists.llvm.org, David Blaikie

Next, we would like to address questions on performance data regarding
clang improvements with BOLT and Propeller:

TLDR; We have provided a Makefile in the git repo, under
llvm-propeller/splo, to compare BOLT and Propeller performance. “make
check-performance” with pointers to the BOLT binaries should do the
comparisons.

In terms of performance optimization for code layout, fundamentally,
BOLT and Propeller are similar. In fact, in one experiment, we
extracted the basic block order from BOLT and just used basic block
sections and linker symbol ordering to perform the same order during
linking. The performance was identical and the ordering was exactly
as expected.

Some recent changes in Propeller has caused it to regress by a small
amount and is sometimes getting hidden by the small noise. We are
debugging the performance issue and it looks like an alignment issue
or a second order effect. Currently, because of this bug, BOLT is
faster than Propeller by about 0.5% to 1% on Clang benchmark.

Sriraman Tallam via llvm-dev

unread,

Oct 10, 2019, 4:35:30 PM10/10/19

to James Y Knight, llvm...@lists.llvm.org, David Blaikie

We have addressed concerns about Debug Info and Overhead below and
talked about Debug Fission which is very important when talking about
scalability of builds.

Also, we have been addressing each topic piecemeal to keep discussions
focussed. I will provide a consolidated document putting together our
clarifications on all the concerns raised.

Large binaries are built using debug fission to as it significantly
improves the scalability of debug builds,
https://gcc.gnu.org/wiki/DebugFission. Quoting from the page which
again reiterates the importance of scalability and incremental builds,
Fission was invented to mitigate the following issues for debug builds
of large applications:

* Out of memory errors
* Slow link times
* Slow gdb startup times

Debug Fission essentially splits the dwarf info into a separate object
file and the link action memory overheads are smaller as the input
object files and the resulting binary are smaller.

Propeller supports both fission and non-fission builds. Even with
fission, the dwarf objects are not relocatable, so the extra
relocations needed by Propeller would still go into the main object
file that contains code. Also, Propeller uses extra relocations for
location lists too.

We looked at BOLT and it does not seem to have any support for
binaries built with fission, something we missed mentioning in the
original RFC. Fission supports a mode which allows the individual
dwarf objects, dwo files, to be linked into a single dwp file and
BOLT would have to also rewrite the fission object file and likely in
the same process.

Another concern raised was regarding debugging overhead. The specific
concern here is that having more debug ranges with basic block
sections might increase overheads of debugging tools. We have not
evaluated this, any benchmarks that could measure these overheads?
Also, this can be mitigated by just exposing debug_ranges to lld. The
various ranges created for each basic block are contiguous as most
basic blocks end up together and these ranges can be collapsed by the
linker into a larger range.

James Y Knight via llvm-dev

unread,

Oct 11, 2019, 1:46:24 PM10/11/19

to Rafael Auler, llvm...@lists.llvm.org

Is there large value from deferring the block ordering to link time? That is, does the block layout algorithm need to consider global layout issues when deciding which blocks to put together and which to relegate to the far-away part of the code?

Or, could the propellor-optimized compile step instead split each function into only 2 pieces -- one containing an "optimally-ordered" set of hot blocks from the function, and another containing the cold blocks? The linker would have less flexibility in placement, but maybe it doesn't actually need that flexibility?

Apologies if this is obvious for those who actually know what they're talking about here. :)

Xinliang David Li via llvm-dev

unread,

Oct 11, 2019, 2:25:39 PM10/11/19

to James Y Knight, llvm...@lists.llvm.org

On Fri, Oct 11, 2019 at 10:46 AM James Y Knight via llvm-dev <llvm...@lists.llvm.org> wrote:

Is there large value from deferring the block ordering to link time? That is, does the block layout algorithm need to consider global layout issues when deciding which blocks to put together and which to relegate to the far-away part of the code?

Or, could the propellor-optimized compile step instead split each function into only 2 pieces -- one containing an "optimally-ordered" set of hot blocks from the function, and another containing the cold blocks? The linker would have less flexibility in placement, but maybe it doesn't actually need that flexibility?

Apologies if this is obvious for those who actually know what they're talking about here. :)

It is a fair question.

We believe the flexibility to do fine grained layout in whole program context is important. PostLinkOptimization is aimed at getting as much performance improvement as possible (usually applied on top of ThinLTO+PGO), so the framework is designed to enable it.

In particular, it allows the linker to stitch hot bb traces from different functions to be stitched together. It also allows hot trace duplication across procedure boundaries (kind of interprocedural tailDup). Besides, code alignment decisions to minimize branch mispredictions may require global context (e.g, too conflicting branches residing in two different functions). Other micro-arch specific optimizations to improve processor front-end throughput may also require global context.

It is conceivable to have an option to control the level of granularity at the possible cost of performance.

thanks,

David

Sriraman Tallam via llvm-dev

unread,

Oct 14, 2019, 2:43:49 PM10/14/19

to Xinliang David Li, llvm...@lists.llvm.org

Hello,

I wanted to consolidate all the discussions and our final thoughts on the concerns raised. I have attached a document consolidating it.

BOLT’s performance gains inspired this work and we believe BOLT

is a great piece of engineering. However, there are build environments where
scalability is critical and memory limits per process are tight :

* Debug Fission, https://gcc.gnu.org/wiki/DebugFission was primarily
invented to achieve scalability and better incremental build times while
building large binaries with debug information.

* ThinLTO,
http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html was
primarily invented to make LLVM’s full LTO scalable and keep the memory and
time overheads low. ThinLTO has enabled much broader adoption of whole
program optimization, by making it non-monolithic.

* For Chromium builds,
https://chromium-review.googlesource.com/c/chromium/src/+/695714/3/build/toolcha
in/concurrent_links.gni, the linker process memory is set to 10GB with ThinLTO.
It was 26GB with Full LTO before that and individual processes will run of out
of memory beyond that.

* Here,
https://gotocon.com/dl/goto-chicago-2016/slides/AysyluGreenberg_BuildingADistrib
utedBuildSystemAtGoogleScale.pdf, a distributed build system at Google scale
is shown where 5 million binary and test builds are performed every day on
several thousands of machines, each with a limitation of 12G of memory per
process and 15 minute time-out on tests. Memory overheads of 35G (clang) are
well above these thresholds.

We have developed Propeller like ThinLTO that can be used to obtain similar
performance gains like BOLT in such environments.

Thanks

Sri

concerns.txt

Maksim Panchenko via llvm-dev

unread,

Oct 16, 2019, 6:53:01 PM10/16/19

to Sriraman Tallam, Xinliang David Li, llvm...@lists.llvm.org

Hi Sri,

I want to clarify one thing before sending a detailed reply: did you evaluate

BOLT on Clang built with basic block sections? In the makefile you reference,

there are two versions: a “vanilla” and a default built with function sections.

High overheads you see with BOLT on Clang do not match our experience.

Thanks,

Maksim

Sriraman Tallam via llvm-dev

unread,

Oct 17, 2019, 5:59:16 PM10/17/19

to Maksim Panchenko, llvm...@lists.llvm.org

Hello Maksim,

On Wed, Oct 16, 2019 at 3:52 PM Maksim Panchenko <ma...@fb.com> wrote:

Hi Sri,

I want to clarify one thing before sending a detailed reply: did you evaluate

BOLT on Clang built with basic block sections?

In the makefile you reference,

there are two versions: a “vanilla” and a default built with function sections.

High overheads you see with BOLT on Clang do not match our experience.

Thanks for pointing that out in the Makefile. We double-checked and noticed a bug in our Makefile. For clang, we noticed that we are BOLTING with basic block symbols which seems to affect the memory consumption of BOLT. So, we have re-measured with recent bolt and for *full transparency* we have uploaded the binaries, BOLT's yaml files and perf.data files and the commands so that you can independently verify our results and check the binaries. We have gzipped all the files to keep it under 2G limit for git lfs, everything is here : https://github.com/google/llvm-propeller/tree/plo-dev/clang-bolt-experiment We have run our experiments on a 192G machine with Intel 18 core.

We built llvm-bolt with most recent sources and is *pristine* with none of our patches and uploading the binary we used here, https://github.com/google/llvm-propeller/blob/plo-dev/clang-bolt-experiment/llvm-bolt That's a very recent BOLT binary, git hash: 988a7e8819b882fd14e18d149f8d3f702b134680

The https://github.com/google/llvm-propeller/tree/plo-dev/clang-bolt-experiment/{v1,v2} contains two sets of binaries. The first binary is pristine recent clang-10 built from 2 weeks ago with additionally only -Wl,-q. v2 is another clang binary also only additionally built with -q. There are no labels or sections or any other Propeller flags used to build these clang binaries. Here is the command we are using to optimize with BOLT, all the commands have been checked in too.

You should be able to run llvm-bolt now on these binaries as all the files are provided. We have also provided the raw perf data files in case you want to independently convert.

$ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+ -reorder-functions=hfsort -relocs=1 --update-debug-sections

For version 2, this is the number:

Elapsed (wall clock) time (h:mm:ss or m:ss): 2:05.40

Maximum resident set size (kbytes): 18742688

That is 125 seconds and ~18G of RAM.

For version 1, this hangs and we stopped it after several minutes and the maximum RSS size crossing 50G. This is most likely just a bug and you should be able to reproduce this. The binary seems to be running and printing update messages.

We also measured without update-debug-sections too with the command :

$ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+ -reorder-functions=hfsort -relocs=1

For version1 :

Elapsed (wall clock) time (h:mm:ss or m:ss): 1:33.74

Maximum resident set size (kbytes): 14824444

93 seconds and ~14G of RAM

version 2 :

Elapsed (wall clock) time (h:mm:ss or m:ss): 1:21.90

Maximum resident set size (kbytes): 14511912

similar 91 secs and ~14G

Now, coming back to the bug in the Makefile, we originally reported ~35G. That is *wrong* since the clang binary used to measure bolt overheads was built with basic block labels. Our *sincere apologies* for this, this showed BOLT as consuming more memory than is actual for clang. We double-checked BOLT numbers with the internal benchmark search2 for sanity and that is built *without any labels* and only with "-Wl,-q". We are checking the other large internal benchmarks too. We cannot disclose internal benchmarks. So, we will get more large open-source benchmarks like Chrome or gcc built with bolt and share the binaries and results so you can independently verify.

Thanks

Sri

Maksim Panchenko via llvm-dev

unread,

Oct 18, 2019, 1:57:10 PM10/18/19

to Sriraman Tallam, llvm...@lists.llvm.org

Cool. The new numbers look good. If you run BOLT with jemalloc library

preloaded, you will likely get a runtime closer to 1 minute. We’ve noticed that

compared to the default malloc, it improves the multithreaded

performance and brings down memory usage significantly.

Thanks,

Maksim

Sriraman Tallam via llvm-dev

unread,

Oct 18, 2019, 2:21:15 PM10/18/19

to Maksim Panchenko, llvm...@lists.llvm.org

Hello Maksim,

On Fri, Oct 18, 2019 at 10:57 AM Maksim Panchenko <ma...@fb.com> wrote:

Cool. The new numbers look good. If you run BOLT with jemalloc library

preloaded, you will likely get a runtime closer to 1 minute. We’ve noticed that

compared to the default malloc, it improves the multithreaded

performance and brings down memory usage significantly.

Great, thanks for confirming! Would you be willing to share specific numbers, how significant is the reduction in memory with jemalloc for clang? We double-checked our numbers with the larger benchmarks and we can confirm they were *not built with labels*. One of our large benchmarks, search1, is about 5 times the size of clang in terms of text size as reported by size command, and we are seeing a 70G memory overhead on this. Do you have data on the memory consumption of BOLT with larger benchmarks with jemalloc. We are trying to build Chrome with latest BOLT so that we can share the memory overheads and the binaries with you for transparency but we are struggling with the disassembly errors. If you have data on large benchmarks we would appreciate it if you could share it.

Further, if you have a recipe to use jemalloc with BOLT, please point it at us. We could try it out too and share our findings.

Thanks much,

Sri

Maksim Panchenko via llvm-dev

unread,

Oct 22, 2019, 1:07:58 AM10/22/19

to Sriraman Tallam, llvm...@lists.llvm.org

Hi Sri,

Thank you for replying to our feedback. 7 out 12 high-level concerns have been

answered; 2 of them are fully addressed. The rest are being tracked at the

following Google doc:

https://docs.google.com/document/d/1jqjUZc8sEoNl6_lrVD6ZkITyFBFqhUOZ6ZaDm_XVbb8/

To keep the discussion at a high level, I have to reference the LTO vs ThinLTO

comparison since that appears to be the central theme in your response to the

feedback.

Unlike LTO, BOLT does not have to keep the entire program in memory.

Furthermore, as we have previously mentioned, most of the passes are run in

parallel, and the performance scales well with the number of CPUs.

To demonstrate that running BOLT on a hot subset of functions is not just a

speculation, we have prototyped a "Thin" version that optimizes Clang-7 in under

15 seconds using less than 4GB of memory. No modifications to the linker or

compiler were required. And by the way, that appears to be faster than just the

re-linking phase of the Propeller. Larger loads show similar improvements

providing 2x-5x savings over the original processing specs.

Let me reiterate that current BOLT requires large amounts of memory not because

it's a fundamental limitation, unlike LTO. For us, system memory was never a

constraint. The runtime of the application, not BOLT, was the primary goal

during the development.

ThinLTO design solves a real problem and dramatically improves compilation time

even when building on a single node. ThinLTO results provide "end-to-end build

time" comparison to LTO. I've asked you to show a similar comparison for

Propeller vs. BOLT. I haven't seen the results, and I suspect the total overhead

will exceed that of even the oldest non-parallel version of BOLT.

One argument I've heard is that BOLT is not taking advantage of the distributed

build system. That's correct. It does not have to since it does not require to

rebuild the application. In "Thin" mode, the overhead is similar to a regular

linker running with a linker script.

You are right that we do not support debug fission packages. It is unimplemented

for a simple reason: no one asked for it previously. And as we like to say in

the open-source community: "patches are welcome."

Maksim

P.S. We have updated https://github.com/facebookincubator/BOLT with instructions on running BOLT with jemalloc or tcmalloc.

Sriraman Tallam via llvm-dev

unread,

Oct 22, 2019, 1:39:35 AM10/22/19

to Maksim Panchenko, llvm...@lists.llvm.org

We are going to be at the llvm-dev meeting the next two days. We will get back to you after that.

Sri

Xinliang David Li via llvm-dev

unread,

Oct 22, 2019, 3:04:09 AM10/22/19

to Maksim Panchenko, llvm...@lists.llvm.org

Hi Maksim, see my reply inline below.

On Mon, Oct 21, 2019 at 10:07 PM Maksim Panchenko <ma...@fb.com> wrote:

Hi Sri,

Thank you for replying to our feedback. 7 out 12 high-level concerns have been

answered; 2 of them are fully addressed. The rest are being tracked at the

following Google doc:

https://docs.google.com/document/d/1jqjUZc8sEoNl6_lrVD6ZkITyFBFqhUOZ6ZaDm_XVbb8/

To keep the discussion at a high level, I have to reference the LTO vs ThinLTO

comparison since that appears to be the central theme in your response to the

feedback.

Unlike LTO, BOLT does not have to keep the entire program in memory.

IMO, there is no fundamental difference there. Partial (non Whole Program) LTO does not need to put everything in memory either, but it is still MonoLTO, and does not make it ThinLTO.

Like ThinLTO, Propeller is designed with the following principles in mind:

1) The serial step of Propeller (either an integrated component of linker or a plugin to it) is made as thin as possible -- it works on lightweight summary like data

2) The serial step makes global decisions and do not make transformations

3) Code transformations are jobs of compiler backend and linker (reordering etc) and there will be clear interface between Propeller and the transformers (longer term with persistent McInst support)

4) Propeller is designed to be turned on for tens of thousands of build targets, in other words, it should be turned on for *everyone* by default. This means the memory footprint for the optimization much be reasonably small. It should integrate easily with distributed build systems.

5) Related to goal 4) there, disassembling should be avoid as much as possible. We have experienced all kinds of problems with it, and it can be a constant source of pain (not to mention the memory overhead).

Furthermore, as we have previously mentioned, most of the passes are run in

parallel, and the performance scales well with the number of CPUs.

If a pass can be run in parallel (e.g, function level), then PLO may be the wrong place to do such optimizations.

To demonstrate that running BOLT on a hot subset of functions is not just a

speculation, we have prototyped a "Thin" version that optimizes Clang-7 in under

15 seconds using less than 4GB of memory. No modifications to the linker or

compiler were required. And by the way, that appears to be faster than just the

re-linking phase of the Propeller. Larger loads show similar improvements

providing 2x-5x savings over the original processing specs.

If you can provide a patch so that we can try it with large benchmarks, that will be great.

On the other hand, Clang's text is relatively small (<100M?), so using 4GB memory is still quite large IMO. Sri and team tried to bring up BOLT for Chrome (which is about 2 or 3x larger than Clang) for the last two days, but have not been successful. They encountered all sorts of problems --- after fixing many sources that led to disassembly errors, they got repeatedly hit by 'jump into the middle of insn' errors. See also 5) above. (Similarly, a few months time were spent on bringing up BOLT for an internal benchmark which is 8x larger than Clang)

Let me reiterate that current BOLT requires large amounts of memory not because

it's a fundamental limitation, unlike LTO. For us, system memory was never a

constraint. The runtime of the application, not BOLT, was the primary goal

during the development.

BOLT is an amazing engineering pierce that inspired Propeller. See the objectives above, system memory is one of the primary design goals for it , and it is architected to enforce the policy.

ThinLTO design solves a real problem and dramatically improves compilation time

even when building on a single node. ThinLTO results provide "end-to-end build

time" comparison to LTO. I've asked you to show a similar comparison for

Propeller vs. BOLT. I haven't seen the results, and I suspect the total overhead

will exceed that of even the oldest non-parallel version of BOLT.

I have explained this above and we are confident that Propeller can be turned on by default (out of box) for any targets just like ThinLTO today (and this is indeed the goal). We will continue to iterate on later versions to make it happen.

One argument I've heard is that BOLT is not taking advantage of the distributed

build system. That's correct. It does not have to since it does not require to

rebuild the application. In "Thin" mode, the overhead is similar to a regular

linker running with a linker script.

4GB in Thin Mode for a Clang sized program is not the scalability goal to shoot for wider adoption of PLO.

thanks,

David

Sriraman Tallam via llvm-dev

unread,

Jan 17, 2020, 7:19:29 PM1/17/20

to llvm-dev

We have been working on addressing most of the concerns raised and we
have updated the patches with the following changes.

1) Exception Handling

TLDR; We now handle exceptions and can optimize exception heavy
benchmarks and clang with exceptions enabled.

We have updated Propeller patches to handle exceptions. In contrast to
the previous approach which grouped all exception-handling related
basic blocks into one section, we now group just the landing pads
within one section and create a separate section for every other basic
block (even those which potentially throw exceptions by calling other
functions). This allows us to reorder code for exception-heavy
programs with greater flexibility. Grouping all landing pad basic
blocks in the same section lets us satisfy the ABI requirement of “one
landing pad fragment per procedure fragment.” This could be further
relaxed to separately grouping together all the landing pads which are
used by every basic block section. We will leave this for future
updates.

With this change, we are able to speedup the SPEC benchmark xalancbmk
by 0.5% which we were unable to do so previously. (Without exception
handling support this benchmark regressed by 1%).

We also built clang with exceptions enabled (DLLVM_ENABLE_EH=ON) and
were able to get performance improvements similar to what we got
without exceptions.

2) Minimizing the number of basic block sections

TLDR; We have significantly cut down the number of basic blocks for
which sections are created.

Previously, basic block sections were created for all the basic
blocks in a function that had samples. We have modified the patches
to allow creation of sections for only those basic blocks that have
samples. This greatly reduces the number of sections created. This
does not affect benchmark performance in any manner. For clang,
making this change reduces the object size bloat from 100% (2x) to 50%
(1.5x). Clang is an extreme benchmark for object size bloats since the
perf samples touch 10% of all functions. For larger datacenter
workload benchmarks we evaluated, the object size bloats are less than
5% with this change.

We are working on further reducing this so that the object file size
bloats can be kept as minimal as possible. Currently, we create a
section for a basic block even if it received just one perf sample.
This is overkill and can be restricted to only basic blocks that were
really hot.

3) Interprocedural Basic Block Reordering

TLDR; Interprocedural basic block reordering further improves
performance by 1% (clang benchmark) but is slower and we are working
on a better implementation.

Previously, the function reordering done was intra-procedural and one
of the comments was that if this could be done inter-procedurally. We
have now added an option to do this inter-procedurally, it is not the
default. While the inter-procedural reordering algorithm is slower by
10x (increasing from 3 seconds to 30 seconds for clang) because the
code layout search space is much larger, the performance of the final
benchmark improves by an additional 1%. We are working on improving
the implementation to make this much faster.

4) Propeller functionality in lld is separated out

After discussing with Rui, we have moved Propeller code into a
separate directory. Propeller interfaces with lld through
well-defined API. This provides the advantage that Propeller code
itself and future changes can be reviewed independently.

5) Reusing optimized LLVM IR files to re-link

TLDR; We are working on re-using native object files from previous
link and patches will be available soon

The final re-link with Propeller can directly start from the high
level optimized IR files. We are working on a patch to support
options “--save-precodegen” which will automatically save optimized IR
files which will later be reused by Propeller to reduce link time,
Branch : https://github.com/google/llvm-propeller/tree/plo-codegen-backend.

We are also working on directly using native object files from the
previous link for those objects which did not accrue any profiling
samples. Even for the clang benchmark, only 7% of the objects needed
to be re-generated with basic block sections and the cached native
object files can be used for the rest. This approach works
particularly well with ThinLTO as the lto.cache can directly be used
to obtain the native object files for modules that are not affected.
We will send out patches for this soon.

6) Time Overheads of CFI and Debug Info

TLDR; Experiments show varying overheads with Debug Info depending
on the symbolizers. We did not see any overheads with CFI.
Collapsing .debug_ranges in the linker solves the overheads associated
with this.

We tried to measure two things. Overheads from accessing .eh_frame
and .debug_ranges. With Basic block sections, a new CFI FDE is
created for every basic block that gets a new section and a new range
pair is added to debug_ranges. We wrote a benchmark , using
libunwind, to dump the stack trace (no symbolization) and then do then
dump the stack trace running the symbolizer (symbolization) using a
Google Internal API.

The benchmark had ~12000 functions and basic block sections for all
basic blocks resulted in ~130000 more basic block sections and
equivalent debug range entries, an order of magnitude more.

For a vanilla binary, on average over several iterations, it took
~70ns to dump the stack trace without symbolization and ~235000 ns
with symbolization. We then enabled basic block sections across the
board and re-measured and found no measurable impact. Symbolization
seems to dominate the time.

We then measured overheads with addr2line and llvm-symbolizer. We
just timed symbolizing a couple of addresses. With addr2line, the
binary with sections slows down by 5% to 10%. With llvm-symbolizer,
the slowdown is much more, by upto 30% sometimes. Looking at the
symbolizer code, we found places where the symbols are scanned
linearly and also places where the .debug_ranges are not collapsed.
Looking at the different symbolizer implementations, we found that
some use binary search, some collapse debug ranges as and when they
are added and some just do a linear search.

The primary reason for this overhead is that the .debug_ranges are
much larger. We are looking at improving this by doing one of these:

a) Have the linker collapse the debug_ranges which are pretty much
contiguous as most of the basic blocks end up in the same function.
This is the easier solution and we are working on a patch. This should
completely solve the problem.
b) Alternately, modify the symbolizers to collapse the ranges as and
when they are added.

All our experiments were with full basic block sections. When basic
block sections are turned on selectively, these problems should not
manifest even now.

7) New Relocation types to simplify linker relaxation of Basic Block
Section Jumps

We are working on adding two new relocation types: R_X86_64_PC32_JUMPX
and R_X86_64_PC8_JUMPX to the x86-64 ps-ABI to make linker handing of
relaxation simplified. Specifically, these two relocation types will
allow the linker to easily identify the jump instructions that were
created for bb sections. The PC8 relocations will only have to be
checked for growing to 32 bits during relaxation.

Proposal here: https://groups.google.com/forum/?hl=en#!topic/x86-64-abi/-2AkYw1QJuI

Thanks,
Sri

Fangrui Song via llvm-dev

unread,

Feb 7, 2020, 2:06:23 PM2/7/20

to llvm-dev

>On Mon, Oct 21, 2019 at 10:38 PM Sriraman Tallam <tmsr...@google.com> wrote:
>>
>> We are going to be at the llvm-dev meeting the next two days. We will get back to you after that.
>>
>> Sri
>>
>> On Mon, Oct 21, 2019 at 10:07 PM Maksim Panchenko <ma...@fb.com> wrote:

I took some time to understand the lld side changes
(D68065+D73497+D68073; the patch do not have clear relations and do not
apply at HEAD cleanly), attended a LLVM Social yesterday and asked some
people about their feelings. We are still dubious whether doing all the
heavy lifting on the linker side is the right approach. Disassembly at
linker side may give some short term benefits, but the interaction with
debug info/asynchronous unwind tables/symbol information is muddy. (From
the existing posted patch series it is not clear how they are solved.)

Doing non-LTO things on the linkder side is rather inflexible.
Linkers don't really understand the semantics. So I see things like:

* Get the input section size, check if there is a relocation (for JMP)
r_offset = input_section_size-4, if there is a JCC at r_offset =
input_section_size-9, ...
* Update st_value to take shrunk input sections into account.
* An algorithm to the extended travelling salesman problem looks fancy.
It also introduced a bunch of parameters. Have other ordering approaches been considered?
I am concerned that the raw ELF fields (like sh_size) are used for
heuristics. It seems to try reconstructing some information about
Machine{Function,BasicBlock,Instr,...} about doing that in the
Machine* stage may just be simpler and more accurate.

We feel that the available optimizations may be limited. The lld focused
implementation may have some improvement in the near term but the
benefits will dilute quickly when the Machine{Function,BasicBlock}
interfaces get improved.

Also, doing heavy-lifting work on the linker side:

Needs non-trivial work to port to another binary format. Another target
may be a very different story. (Even for x86-64, I see the patches
attend to address R_X86_64_PC8/R_X86_64_PC32/R_X86_64_PLT32 calls, but
the two relocation types are not handled.)
The newly lld/ELF code overlaps with jobs of MachineBlockPlacement/BranchFolding/AsmPrinter/etc.

If the current MachineFunction interface is not flexible (e.g. it does
not allow sharable MachineBasicBlocks between MachineFunctions.) enough,
let's fix it.

Rahman Lavaee via llvm-dev

unread,

Feb 10, 2020, 2:14:16 PM2/10/20

to Fangrui Song, llvm-dev

I respond with regards to "unwind tables" and "code layout".

WIth regards to "unwind tables", the idea of splitting call-sites tables has been implemented in D73739.

I think "asynchronous unwind tables" are already supported by our patch (Meaning that the unwind tables are precise at instruction boundary).

Please let us know if you believe otherwise.

* An algorithm to the extended travelling salesman problem looks fancy.
It also introduced a bunch of parameters. Have other ordering approaches been considered?

The Extended-TSP algorithm is published in https://arxiv.org/abs/1809.04676 and shows a speedup improvement of about 0.7% compared to TSP (on Clang), which I confirmed independently.

The general idea of considering jump distance in the layout decision is ubiquitous (https://dl.acm.org/doi/10.1145/3302516.3307358 (codestitcher), https://dl.acm.org/doi/10.5555/3049832.3049858 (hfsort) ).

We have considered other ordering approaches such as pettis-hansen and codestitcher (both of which are based on TSP) and we have found Ext-TSP to be the best.

You can also refer to the codestitcher and ext-TSP paper for more extensive study.

Let me point out that although the formulation of ext-TSP is more involved than TSP, the heuristic algorithm that is used to solve it is essentially the same as the one for TSP. It's an incremental changing algorithm aimed at maximizing the gain in the score at each step.

On the other hand, most of the complication in the algorithm is caused by split-and-remerge (guided by -propeller-chain-split-threshold) which is highly beneficial for escaping local optima, resulting in 2% more performance improvement. The original ext-TSP work sets the split threshold at 128 bytes, while we can achieve better speedup by setting it to 1024bytes, without sacrificing the complication time (The whole reordering algorithm takes about 4 seconds on clang).

I am concerned that the raw ELF fields (like sh_size) are used for
heuristics. It seems to try reconstructing some information about
Machine{Function,BasicBlock,Instr,...} about doing that in the
Machine* stage may just be simpler and more accurate.

Yes. Basic block (binary) sizes are used to compute the binary distance for jumps in the final layout (as used by the ext-TSP algorithm). We can currently get this information both from the object code (sections), or the BB symbol sizes (propagated through the propeller profile). Getting this information in the machine stage is not as accurate and convenient. For instance, StackMapShadowTracker has been implemented in llvm to count the instruction bytes after the last StackMap and it shares some code between different classes (X86AsmPrinter and X86MCInstLower).

Sriraman Tallam via llvm-dev

unread,

Feb 11, 2020, 1:32:09 PM2/11/20

to Fangrui Song, llvm-dev

I will specifically answer the relocations part as Rahman has
addressed the other things. This was discussed previously and Eli and
Rui suggested we use specific relocations for jmp instructions which
can be relaxed. We are working on that one here:
https://groups.google.com/g/x86-64-abi/c/-2AkYw1QJuI With the
relocations added, the relaxation code would become a lot simpler.

Reply all

Reply to author

Forward